study
SCALAR
study
This function takes extra time to study
SCALAR
(
$_
if
unspecified) in anticipation of doing many pattern matches on the
string before it is next modified. This may or may not save time,
depending on the nature and number of patterns you are searching on,
and on the distribution of character frequencies in the string to be
searched - you probably want to compare run-times with and without it to
see which runs faster. Those loops that scan for many short constant
strings (including the constant parts of more complex patterns) will
benefit most. If all your pattern matches are constant strings,
anchored at the front,
study
won't help at all, because no
scanning is done. You may have only one
study
active at a time - if
you study a different scalar the first is "unstudied".
The way
study
works is this: a linked list of every character in the string to be
searched is made, so we know, for example, where all the "
k
"
characters are. From each search string, the rarest character is
selected, based on some static frequency tables constructed from some
C programs and English text. Only those places that contain this rarest character are examined.
For example, here is a loop that inserts index-producing entries
before any line containing a certain pattern:
while (<>) {
study;
print ".IX foo\n" if /\bfoo\b/;
print ".IX bar\n" if /\bbar\b/;
print ".IX blurfl\n" if /\bblurfl\b/;
...
print;
}
In searching for
/\bfoo\b/
, only those locations in
$_
that contain "
f
" will be looked at,
because "
f
" is rarer than "
o
". In
general, this is a big win except in pathological cases. The only question is
whether it saves you more time than it took to build the linked list in the
first place.
If you have to look for strings that you don't know until run-time, you can
build an entire loop as a string and
eval
that to avoid recompiling all your patterns all the time. Together with setting
$/
to input entire files as one record, this can
be very fast, often faster than specialized programs like
fgrep
. The following scans a list of files
(
@files
) for a list of words (
@words
), and
prints out the names of those files that contain a match:
$search = 'while (<>) { study;';
foreach $word (@words) {
$search .= "++\$seen{\$ARGV} if /\\b$word\\b/;\n";
}
$search .= "}";
@ARGV = @files;
undef $/; # slurp each entire file
eval $search; # this screams
die $@ if $@; # in case eval failed
$/ = "\n"; # put back to normal input delim
foreach $file (sort keys(%seen)) {
print $file, "\n";
}