Efficiency (Programming Perl)

24.2. Efficiency

While most of the work of programming may be simply getting your program working properly, you may find yourself wanting more bang for the buck out of your Perl program. Perl's rich set of operators, data types, and control constructs are not necessarily intuitive when it comes to speed and space optimization. Many trade-offs were made during Perl's design, and such decisions are buried in the guts of the code. In general, the shorter and simpler your code is, the faster it runs, but there are exceptions. This section attempts to help you make it work just a wee bit better.

If you want it to work a lot better, you can play with the Perl compiler backend described in Chapter 18, "Compiling", or rewrite your inner loop as a C extension as illustrated in Chapter 21, "Internals and Externals".

Note that optimizing for time may sometimes cost you in space or programmer efficiency (indicated by conflicting hints below). Them's the breaks. If programming was easy, they wouldn't need something as complicated as a human being to do it, now would they?

24.2.1. Time Efficiency

Use hashes instead of linear searches. For example, instead of searching through @keywords to see if $_ is a keyword, construct a hash with:
```
my %keywords;
for (@keywords) {
    $keywords{$_}++;
}
```
Then you can quickly tell if $_ contains a keyword by testing $keyword{$_} for a nonzero value.
Avoid subscripting when a foreach or list operator will do. Not only is subscripting an extra operation, but if your subscript variable happens to be in floating point because you did arithmetic, an extra conversion from floating point back to integer is necessary. There's often a better way to do it. Consider using foreach, shift, and splice operations. Consider saying use integer.
Avoid goto. It scans outward from your current location for the indicated label.
Avoid printf when print will do.
Avoid $& and its two buddies, $` and $'. Any occurrence in your program causes all matches to save the searched string for possible future reference. (However, once you've blown it, it doesn't hurt to have more of them.)
Avoid using eval on a string. An eval of a string (although not of a BLOCK) forces recompilation every time through. The Perl parser is pretty fast for a parser, but that's not saying much. Nowadays there's almost always a better way to do what you want anyway. In particular, any code that uses eval merely to construct variable names is obsolete since you can now do the same directly using symbolic references:
```
no strict 'refs';
$name = "variable";
$$name = 7;           # Sets $variable to 7
```
Avoid evalSTRING inside a loop. Put the loop into the eval instead, to avoid redundant recompilations of the code. See the study operator in Chapter 29, "Functions" for an example of this.
Avoid run-time-compiled patterns. Use the /pattern/o (once only) pattern modifier to avoid pattern recompilation when the pattern doesn't change over the life of the process. For patterns that change occasionally, you can use the fact that a null pattern refers back to the previous pattern, like this:
```
"foundstring" =~ /$currentpattern/;        # Dummy match (must succeed).
while (<>) {
    print if //;
}
```
Alternatively, you can precompile your regular expression using the qr quote construct. You can also use eval to recompile a subroutine that does the match (if you only recompile occasionally). That works even better if you compile a bunch of matches into a single subroutine, thus amortizing the subroutine call overhead.
Short-circuit alternation is often faster than the corresponding regex. So:
```
print if /one-hump/ || /two/;
```
is likely to be faster than:
```
print if /one-hump|two/;
```
at least for certain values of one-hump and two. This is because the optimizer likes to hoist certain simple matching operations up into higher parts of the syntax tree and do very fast matching with a Boyer-Moore algorithm. A complicated pattern tends to defeat this.
Reject common cases early with next if. As with simple regular expressions, the optimizer likes this. And it just makes sense to avoid unnecessary work. You can typically discard comment lines and blank lines even before you do a split or chop:
```
while (<>) {
    next if /^#/;
    next if /^$/;
    chop;
    @piggies = split(/,/);
    ...
}
```
Avoid regular expressions with many quantifiers or with big {MIN,MAX} numbers on parenthesized expressions. Such patterns can result in exponentially slow backtracking behavior unless the quantified subpatterns match on their first "pass". You can also use the (?>...) construct to force a subpattern to either match completely or fail without backtracking.
Try to maximize the length of any nonoptional literal strings in regular expressions. This is counterintuitive, but longer patterns often match faster than shorter patterns. That's because the optimizer looks for constant strings and hands them off to a Boyer-Moore search, which benefits from longer strings. Compile your pattern with Perl's -Dr debugging switch to see what Dr. Perl thinks the longest literal string is.
Avoid expensive subroutine calls in tight loops. There is overhead associated with calling subroutines, especially when you pass lengthy parameter lists or return lengthy values. In order of increasing desperation, try passing values by reference, passing values as dynamically scoped globals, inlining the subroutine, or rewriting the whole loop in C. (Better than all of those solutions is if you can define the subroutine out of existence by using a smarter algorithm.)
Avoid getc for anything but single-character terminal I/O. In fact, don't use it for that either. Use sysread.
Avoid frequent substrs on long strings, especially if the string contains UTF-8. It's okay to use substr at the front of a string, and for some tasks you can keep the substr at the front by "chewing up" the string as you go with a four-argument substr, replacing the part you grabbed with "":
```
while ($buffer) {
    process(substr($buffer, 0, 10, ""));
}
```
Use pack and unpack instead of multiple substr invocations.
Use substr as an lvalue rather than concatenating substrings. For example, to replace the fourth through seventh characters of $foo with the contents of the variable $bar, don't do this:
```
$foo = substr($foo,0,3) . $bar . substr($foo,7);
```
Instead, simply identify the part of the string to be replaced and assign into it, as in:
```
substr($foo, 3, 4) = $bar;
```
But be aware that if $foo is a huge string and $bar isn't exactly the length of the "hole", this can do a lot of copying too. Perl tries to minimize that by copying from either the front or the back, but there's only so much it can do if the substr is in the middle.
Use s/// rather than concatenating substrings. This is especially true if you can replace one constant with another of the same size. This results in an in-place substitution.
Use statement modifiers and equivalent and and or operators instead of full-blown conditionals. Statement modifiers (like $ring = 0 unless $engaged) and logical operators avoid the overhead of entering and leaving a block. They can often be more readable too.
Use $foo = $a || $b || $c. This is much faster (and shorter to say) than:
```
if ($a) {
    $foo = $a;
}
elsif ($b) {
    $foo = $b;
}
elsif ($c) {
    $foo = $c;
}
```
Similarly, set default values with:
```
$pi ||= 3;
```
Group together any tests that want the same initial string. When testing a string for various prefixes in anything resembling a switch structure, put together all the /^a/ patterns, all the /^b/ patterns, and so on.
Don't test things you know won't match. Use last or elsif to avoid falling through to the next case in your switch statement.
Use special operators like study, logical string operations, pack 'u', and unpack '%' formats.
Beware of the tail wagging the dog. Misstatements resembling (<STDIN>)[0] can cause Perl much unnecessary work. In accordance with Unix philosophy, Perl gives you enough rope to hang yourself.
Factor operations out of loops. The Perl optimizer does not attempt to remove invariant code from loops. It expects you to exercise some sense.
Strings can be faster than arrays.
Arrays can be faster than strings. It all depends on whether you're going to reuse the strings or arrays and which operations you're going to perform. Heavy modification of each element implies that arrays will be better, and occasional modification of some elements implies that strings will be better. But you just have to try it and see.
my variables are faster than local variables.
Sorting on a manufactured key array may be faster than using a fancy sort subroutine. A given array value will usually be compared multiple times, so if the sort subroutine has to do much recalculation, it's better to factor out that calculation to a separate pass before the actual sort.
If you're deleting characters, tr/abc//d is faster than s/[abc]//g.
print with a comma separator may be faster than concatenating strings. For example:
```
print $fullname{$name} . " has a new home directory " .
    $home{$name} . "\n";
```
has to glue together the two hashes and the two fixed strings before passing them to the low-level print routines, whereas:
```
print $fullname{$name}, " has a new home directory ",
    $home{$name}, "\n";
```
doesn't. On the other hand, depending on the values and the architecture, the concatenation may be faster. Try it.
Prefer join("", ...) to a series of concatenated strings. Multiple concatenations may cause strings to be copied back and forth multiple times. The join operator avoids this.
split on a fixed string is generally faster than split on a pattern. That is, use split(/ /, ...) rather than split(/ +/, ...) if you know there will only be one space. However, the patterns /\s+/, /^/, and / / are specially optimized, as is the special split on whitespace.
Pre-extending an array or string can save some time. As strings and arrays grow, Perl extends them by allocating a new copy with some room for growth and copying in the old value. Pre-extending a string with the x operator or an array by setting $#array can prevent this occasional overhead and reduce memory fragmentation.
Don't undef long strings and arrays if they'll be reused for the same purpose. This helps prevent reallocation when the string or array must be re-extended.
Prefer "\0" x 8192 over unpack("x8192",()).
system("mkdir ...") may be faster on multiple directories if the mkdir syscall isn't available.
Avoid using eof if return values will already indicate it.
Cache entries from files (like passwd and group files) that are apt to be reused. It's particularly important to cache entries from the network. For example, to cache the return value from gethostbyaddr when you are converting numeric addresses (like 204.148.40.9) to names (like "www.oreilly.com"), you can use something like:
```
sub numtoname {
    local ($_) = @_;
    unless (defined $numtoname{$_}) {
        my (@a) = gethostbyaddr(pack('C4', split(/\./)),2);
        $numtoname{$_} = @a > 0 ? $a[0] : $_;
    }
    return $numtoname{$_};
}
```
Avoid unnecessary syscalls. Operating system calls tend to be rather expensive. So for example, don't call the time operator when a cached value of $now would do. Use the special _ filehandle to avoid unnecessary stat(2) calls. On some systems, even a minimal syscall may execute a thousand instructions.
Avoid unnecessary system calls. The system function has to fork a subprocess in order to execute the program you specify--or worse, execute a shell to execute the program. This can easily execute a million instructions.
Worry about starting subprocesses, but only if they're frequent. Starting a single pwd, hostname, or find process isn't going to hurt you much--after all, a shell starts subprocesses all day long. We do occasionally encourage the toolbox approach, believe it or not.
Keep track of your working directory yourself rather than calling pwd repeatedly. (A standard module is provided for this. See Cwd in Chapter 30, "The Standard Perl Library".)
Avoid shell metacharacters in commands--pass lists to system and exec where appropriate.
Set the sticky bit on the Perl interpreter on machines without demand paging:
```
chmod +t /usr/bin/perl
```
Allowing built-in functions' arguments to default to $_ doesn't make your program faster.

24.2.2. Space Efficiency

You can use vec for compact integer array storage if the integers are of fixed width. (Integers of variable width can be stored in a UTF-8 string.)
Prefer numeric values over equivalent string values--they require less memory.
Use substr to store constant-length strings in a longer string.
Use the Tie::SubstrHash module for very compact storage of a hash array, if the key and value lengths are fixed.
Use __END__ and the DATA filehandle to avoid storing program data as both a string and an array.
Prefer each to keys where order doesn't matter.
Delete or undef globals that are no longer in use.
Use some kind of DBM to store hashes.
Use temp files to store arrays.
Use pipes to offload processing to other tools.
Avoid list operations and entire file slurps.
Avoid using tr///. Each tr/// expression must store a sizable translation table.
Don't unroll your loops or inline your subroutines.

24.2.3. Programmer Efficiency

Use defaults.
Use funky shortcut command-line switches like -a, -n, -p, -s, and -i.
Use for to mean foreach.
Run system commands with backticks.
Use <*> and such.
Use patterns created at run time.
Use *, +, and {} liberally in your patterns.
Process whole arrays and slurp entire files.
Use getc.
Use $`, $&, and $'.
Don't check error values on open, since <HANDLE> and printHANDLE will simply behave as no-ops when given an invalid handle.
Don't close your files--they'll be closed on the next open.
Don't pass subroutine arguments. Use globals.
Don't name your subroutine parameters. You can access them directly as $_[EXPR].
Use whatever you think of first.

24.2.4. Maintainer Efficiency

Don't use defaults.
Use foreach to mean foreach.
Use meaningful loop labels with next and last.
Use meaningful variable names.
Use meaningful subroutine names.
Put the important thing first on the line using and, or, and statement modifiers (like exit if $done).
Close your files as soon as you're done with them.
Use packages, modules, and classes to hide your implementation details.
Pass arguments as subroutine parameters.
Name your subroutine parameters using my.
Parenthesize for clarity.
Put in lots of (useful) comments.
Include embedded pod documentation.
use warnings.
use strict.

24.2.5. Porter Efficiency

Wave a handsome tip under his nose.
Avoid functions that aren't implemented everywhere. You can use eval tests to see what's available.
Use the Config module or the $^O variable to find out what kind of machine you're running on.
Don't expect native float and double to pack and unpack on foreign machines.
Use network byte order (the "n" and "N" formats for pack) when sending binary data over the network.
Don't send binary data over the network. Send ASCII. Better, send UTF-8. Better yet, send money.
Check $] or $^V to see if the current version supports all the features you use.
Don't use $] or $^V. Use require or use with a version number.
Put in the eval exec hack even if you don't use it, so your program will run on those few systems that have Unix-like shells but don't recognize the #! notation.
Put the #!/usr/bin/perl line in even if you don't use it.
Test for variants of Unix commands. Some find programs can't handle the -xdev switch, for example.
Avoid variant Unix commands if you can do it internally. Unix commands don't work too well on MS-DOS or VMS.
Put all your scripts and manpages into a single network filesystem that's mounted on all your machines.
Publish your module on CPAN. You'll get lots of feedback if it's not portable.

24.2.6. User Efficiency

Instead of making users enter data line by line, pop users into their favorite editor.
Better yet, use a GUI like the Perl/Tk extension, where users can control the order of events. (Perl/Tk is available on CPAN.)
Put up something for users to read while you continue doing work.
Use autoloading so that the program appears to run faster.
Give the option of helpful messages at every prompt.
Give a helpful usage message if users don't give correct input.
Display the default action at every prompt, and maybe a few alternatives.
Choose defaults for beginners. Allow experts to change the defaults.
Use single character input where it makes sense.
Pattern the interaction after other things the user is familiar with.
Make error messages clear about what needs fixing. Include all pertinent information such as filename and error code, like this:
```
open(FILE, $file) or die "$0: Can't open $file for reading: $!\n";
```
Use fork && exit to detach from the terminal when the rest of the script is just batch processing.
Allow arguments to come from either the command line or standard input.
Don't put arbitrary limitations into your program.
Prefer variable-length fields over fixed-length fields.
Use text-oriented network protocols.
Tell everyone else to use text-oriented network protocols!
Tell everyone else to tell everyone else to use text-oriented network protocols!!!
Be vicariously lazy.
Be nice.