CHECK and INIT Blocks (Practical mod

6.5. CHECK and INIT Blocks

The CHECK and INIT blocks run when compilation is complete, but before the program starts. CHECK can mean "checkpoint," "double-check," or even just "stop." INIT stands for "initialization." The difference is subtle: CHECK blocks are run just after the compilation ends, whereas INIT blocks are run just before the runtime begins (hence, the -c command-line flag to Perl runs up to CHECK blocks but not INIT blocks).

Perl calls these blocks only during perl_parse( ), which mod_perl calls once at startup time. Therefore, CHECK and INIT blocks don't work in mod_perl, for the same reason these don't:

panic% perl -e 'eval qq(CHECK { print "ok\n" })'
panic% perl -e 'eval qq(INIT  { print "ok\n" })'

6.5.1. $^T and time( )

Under mod_perl, processes don't quit after serving a single request. Thus, $^T gets initialized to the server startup time and retains this value throughout the process's life. Even if you don't use this variable directly, it's important to know that Perl refers to the value of $^T internally.

For example, Perl uses $^T with the -M, -C, or -A file test operators. As a result, files created after the child server's startup are reported as having a negative age when using those operators. -M returns the age of the script file relative to the value of the $^Tspecial variable.

If you want to have -M report the file's age relative to the current request, reset $^T, just as in any other Perl script. Add the following line at the beginning of your scripts:

local $^T = time;

You can also do:

local $^T = $r->request_time;

The second technique is better performance-wise, as it skips the time( ) system call and uses the timestamp of the request's start time, available via the $r->request_time method.

If this correction needs to be applied to a lot of handlers, a more scalable solution is to specify a fixup handler, which will be executed during the fixup stage:

sub Apache::PerlBaseTime::handler {
    $^T = shift->request_time;
    return Apache::Constants::DECLINED;
}

and then add the following line to httpd.conf:

PerlFixupHandler Apache::PerlBaseTime

Now no modifications to the content-handler code and scripts need to be performed.

6.5.2. Command-Line Switches

When a Perl script is run from the command line, the shell invokes the Perl interpreter via the #!/bin/perl directive, which is the first line of the script (sometimes referred to as the shebang line). In scripts running under mod_cgi, you may use Perl switches as described in the perlrun manpage, such as -w, -T, or -d. Under the Apache::Registry handlers family, all switches except -w are ignored (and use of the -T switch triggers a warning). The support for -w was added for backward compatibility with mod_cgi.

Most command-line switches have special Perl variable equivalents that allow them to be set/unset in code. Consult the perlvar manpage for more details.

mod_perl provides its own equivalents to -w and -T in the form of configuration directives, as we'll discuss presently.

Finally, if you still need to set additional Perl startup flags, such as -d and -D, you can use the PERL5OPT environment variable. Switches in this variable are treated as if they were on every Perl command line. According to the perlrun manpage, only the -[DIMUdmw] switches are allowed.

6.5.2.1. Warnings

There are three ways to enable warnings:

Globally to all processes

In httpd.conf, set:

PerlWarn On

You can then fine-tune your code, turning warnings off and on by setting the $^W variable in your scripts.

Locally to a script

Including the following line:

#!/usr/bin/perl -w

will turn warnings on for the scope of the script. You can turn them off and on in the script by setting the $^W variable, as noted above.

Locally to a block

This code turns warnings on for the scope of the block:

{
    local $^W = 1;
    # some code
}
# $^W assumes its previous value here

This turns warnings off:

{
    local $^W = 0;
    # some code
}
# $^W assumes its previous value here

If $^W isn't properly localized, this code will affect the current request and all subsequent requests processed by this child. Thus:

$^W = 0;

will turn the warnings off, no matter what.

If you want to turn warnings on for the scope of the whole file, as in the previous item, you can do this by adding:

local $^W = 1;

at the beginning of the file. Since a file is effectively a block, file scope behaves like a block's curly braces ({ }), and local $^W at the start of the file will be effective for the whole file.

While having warnings mode turned on is essential for a development server, you should turn it globally off on a production server. Having warnings enabled introduces a non-negligible performance penalty. Also, if every request served generates one warning, and your server processes millions of requests per day, the error_log file will eat up all your disk space and the system won't be able to function normally anymore.

Perl 5.6.x introduced the warnings pragma, which allows very flexible control over warnings. This pragma allows you to enable and disable groups of warnings. For example, to enable only the syntax warnings, you can use:

use warnings 'syntax';

Later in the code, if you want to disable syntax warnings and enable signal-related warnings, you can use:

no  warnings 'syntax';
use warnings 'signal';

But usually you just want to use:

use warnings;

which is the equivalent of:

use warnings 'all';

If you want your code to be really clean and consider all warnings as errors, Perl will help you to do that. With the following code, any warning in the lexical scope of the definition will trigger a fatal error:

use warnings FATAL => 'all';

Of course, you can fine-tune the groups of warnings and make only certain groups of warnings fatal. For example, to make only closure problems fatal, you can use:

use warnings FATAL => 'closure';

Using the warnings pragma, you can also disable warnings locally:

{
  no warnings;
  # some code that would normally emit warnings
}

In this way, you can avoid some warnings that you are aware of but can't do anything about.

For more information about the warnings pragma, refer to the perllexwarn manpage.

6.5.2.2. Taint mode

Perl's -T switch enables taint mode. In taint mode, Perl performs some checks on how your program is using the data passed to it. For example, taint checks prevent your program from passing some external data to a system call without this data being explicitly checked for nastiness, thus avoiding a fairly large number of common security holes. If you don't force all your scripts and handlers to run under taint mode, it's more likely that you'll leave some holes to be exploited by malicious users. (See Chapter 23 and the perlsec manpage for more information. Also read the re pragma's manpage.)

Since the -Tswitch can't be turned on from within Perl (this is because when Perl is running, it's already too late to mark all external data as tainted), mod_perl provides the PerlTaintCheck directive to turn on taint checks globally. Enable this mode with:

PerlTaintCheck On

anywhere in httpd.conf (though it's better to place it as early as possible for clarity).

For more information on taint checks and how to untaint data, refer to the perlsec manpage.

6.5.3. Compiled Regular Expressions

When using a regular expression containing an interpolated Perl variable that you are confident will not change during the execution of the program, a standard speed-optimization technique is to add the /o modifier to the regex pattern. This compiles the regular expression once, for the entire lifetime of the script, rather than every time the pattern is executed. Consider:

my $pattern = '^\d+$'; # likely to be input from an HTML form field
foreach (@list) {
    print if /$pattern/o;
}

This is usually a big win in loops over lists, or when using the grep( ) or map( ) operators.

In long-lived mod_perl scripts and handlers, however, the variable may change with each invocation. In that case, this memorization can pose a problem. The first request processed by a fresh mod_perl child process will compile the regex and perform the search correctly. However, all subsequent requests running the same code in the same process will use the memorized pattern and not the fresh one supplied by users. The code will appear to be broken.

Imagine that you run a search engine service, and one person enters a search keyword of her choice and finds what she's looking for. Then another person who happens to be served by the same process searches for a different keyword, but unexpectedly receives the same search results as the previous person.

There are two solutions to this problem.

The first solution is to use the eval q// construct to force the code to be evaluated each time it's run. It's important that the eval block covers the entire processing loop, not just the pattern match itself.

The original code fragment would be rewritten as:

my $pattern = '^\d+$';
eval q{
    foreach (@list) {
        print if /$pattern/o;
    }
}

If we were to write this:

foreach (@list) {
    eval q{ print if /$pattern/o; };
}

the regex would be compiled for every element in the list, instead of just once for the entire loop over the list (and the /o modifier would essentially be useless).

However, watch out for using strings coming from an untrusted origin inside eval—they might contain Perl code dangerous to your system, so make sure to sanity-check them first.

This approach can be used if there is more than one pattern-match operator in a given section of code. If the section contains only one regex operator (be it m// or s///), you can rely on the property of the null pattern, which reuses the last pattern seen. This leads to the second solution, which also eliminates the use of eval.

The above code fragment becomes:

my $pattern = '^\d+$';
"0" =~ /$pattern/; # dummy match that must not fail!
foreach (@list) {
    print if //;
}

The only caveat is that the dummy match that boots the regular expression engine mustsucceed—otherwise the pattern will not be cached, and the // will match everything. If you can't count on fixed text to ensure the match succeeds, you have two options.

If you can guarantee that the pattern variable contains no metacharacters (such as *, +, ^, $, \d, etc.), you can use the dummy match of the pattern itself:

$pattern =~ /\Q$pattern\E/; # guaranteed if no metacharacters present

The \Q modifier ensures that any special regex characters will be escaped.

If there is a possibility that the pattern contains metacharacters, you should match the pattern itself, or the nonsearchable \377 character, as follows:

"\377" =~ /$pattern|^\377$/; # guaranteed if metacharacters present

6.5.3.1. Matching patterns repeatedly

Another technique may also be used, depending on the complexity of the regex to which it is applied. One common situation in which a compiled regex is usually more efficient is when you are matching any one of a group of patterns over and over again.

To make this approach easier to use, we'll use a slightly modified helper routine from Jeffrey Friedl's book Mastering Regular Expressions (O'Reilly):

sub build_match_many_function {
    my @list = @_;
    my $expr = join '||', 
        map { "\$_[0] =~ m/\$list[$_]/o" } (0..$#list);
    my $matchsub = eval "sub { $expr }";
    die "Failed in building regex @list: $@" if $@;
    return $matchsub;
}

This function accepts a list of patterns as an argument, builds a match regex for each item in the list against $_[0], and uses the logical || (OR) operator to stop the matching when the first match succeeds. The chain of pattern matches is then placed into a string and compiled within an anonymous subroutine using eval. If eval fails, the code aborts with die( ); otherwise, a reference to this subroutine is returned to the caller.

Here is how it can be used:

my @agents = qw(Mozilla Lynx MSIE AmigaVoyager lwp libwww);
my $known_agent_sub = build_match_many_function(@agents);

while (<ACCESS_LOG>) {
    my $agent = get_agent_field($_);
    warn "Unknown Agent: $agent\n"
        unless $known_agent_sub->($agent);
}

This code takes lines of log entries from the access_log file already opened on the ACCESS_LOG file handle, extracts the agent field from each entry in the log file, and tries to match it against the list of known agents. Every time the match fails, it prints a warning with the name of the unknown agent.

An alternative approach is to use the qr// operator, which is used to compile a regex. The previous example can be rewritten as:

my @agents = qw(Mozilla Lynx MSIE AmigaVoyager lwp libwww);
my @compiled_re = map qr/$_/, @agents;

while (<ACCESS_LOG>) {
    my $agent = get_agent_field($_);
    my $ok = 0;
    for my $re (@compiled_re) {
        $ok = 1, last if /$re/;
    }
    warn "Unknown Agent: $agent\n"
        unless $ok;
}

In this code, we compile the patterns once before we use them, similar to build_match_many_function( ) from the previous example, but now we save an extra call to a subroutine. A simple benchmark shows that this example is about 2.5 times faster than the previous one.