Forking and Executing Subprocessesfrom mod_perl (Practical mod

10.2. Forking and Executing Subprocessesfrom mod_perl

When you fork Apache, you are forking the entire Apache server, lock, stock and barrel. Not only are you duplicating your Perl code and the Perl interpreter, but you are also duplicating all the core routines and whatever modules you have used in your server—for example, mod_ssl, mod_rewrite, mod_log, mod_proxy, and mod_speling (no, that's not a typo!). This can be a large overhead on some systems, so wherever possible, it's desirable to avoid forking under mod_perl.

Modern operating systems have a light version of fork( ), optimized to do the absolute minimum of memory-page duplication, which adds little overhead when called. This fork relies on the copy-on-write technique. The gist of this technique is as follows: the parent process's memory pages aren't all copied immediately to the child's space on fork( ) ing; this is done later, when the child or the parent modifies the data in the shared memory pages.

If you need to call a Perl program from your mod_perl code, it's better to try to convert the program into a module and call it as a function without spawning a special process to do that. Of course, if you cannot do that or the program is not written in Perl, you have to call the program via system( ) or an equivalent function, which spawns a new process. If the program is written in C, you can try to write some Perl glue code with help of the Inline, XS, or SWIG architectures. Then the program will be executed as a Perl subroutine and avoid a fork( ) call.

Also by trying to spawn a subprocess, you might be trying to do the wrong thing. If you just want to do some post-processing after sending a response to the browser, look into the PerlCleanupHandler directive. This allows you to do exactly that. If you just need to run some cleanup code, you may want to register this code during the request processing via:

my $r = shift;
$r->register_cleanup(\&do_cleanup);
sub do_cleanup{ #some clean-up code here }

But when a lengthy job needs to be done, there is not much choice but to use fork( ). You cannot just run such a job within an Apache process, since firstly it will keep the Apache process busy instead of letting it do the job it was designed for, and secondly, unless it is coded so as to detach from the Apache processes group, if Apache should happen to be stopped the lengthy job might be terminated as well.

In the following sections, we'll discuss how to properly spawn new processes under mod_perl.

10.2.1. Forking a New Process

The typical way to call fork( ) under mod_perl is illustrated in Example 10-13.

Example 10-13. fork1.pl

defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
    # Parent runs this block
}
else {
    # Child runs this block
    # some code comes here
    CORE::exit(0);
}
# possibly more code here usually run by the parent

When using fork( ), you should check its return value, since a return of undef it means that the call was unsuccessful and no process was spawned. This can happen for example, when the system is already running too many processes and cannot spawn new ones.

When the process is successfully forked, the parent receives the PID of the newly spawned child as a returned value of the fork( ) call and the child receives 0. Now the program splits into two. In the above example, the code inside the first block after if will be executed by the parent, and the code inside the first block after else will be executed by the child.

It's important not to forget to explicitly call exit( ) at the end of the child code when forking. If you don't and there is some code outside the if...else block, the child process will execute it as well. But under mod_perl there is another nuance—you must use CORE::exit( ) and not exit( ), which would be automatically overriden by Apache::exit( ) if used in conjunction with Apache::Registry and similar modules. You want the spawned process to quit when its work is done, or it'll just stay alive, using resources and doing nothing.

The parent process usually completes its execution and returns to the pool of free servers to wait for a new assignment. If the execution is to be aborted earlier for some reason, you should use Apache::exit( ) or die( ). In the case of Apache::Registry or Apache::PerlRun handlers, a simple exit( ) will do the right thing.

10.2.2. Freeing the Parent Process

In the child code, you must also close all the pipes to the connection socket that were opened by the parent process (i.e., STDIN and STDOUT) and inherited by the child, so the parent will be able to complete the request and free itself for serving other requests. If you need the STDIN and/or STDOUTstreams, you should reopen them. You may need to close or reopen the STDERR file handle, too. As inherited from its parent, it's opened to append to the error_log file, so the chances are that you will want to leave it untouched.

Under mod_perl, the spawned process also inherits the file descriptor that's tied to the socket through which all the communications between the server and the client pass. Therefore, you need to free this stream in the forked process. If you don't, the server can't be restarted while the spawned process is still running. If you attempt to restart the server, you will get the following error:

[Mon May 20 23:04:11 2002] [crit] 
(98)Address already in use: make_sock:
  could not bind to address 127.0.0.1 port 8000

Apache::SubProcess comes to help, providing a method called cleanup_for_exec( ) that takes care of closing this file descriptor.

The simplest way to free the parent process is to close the STDIN, STDOUT, and STDERRstreams (if you don't need them) and untie the Apache socket. If the mounted partition is to be unmounted at a later time, in addition you may want to change the current directory of the forked process to / so that the forked process won't keep the mounted partition busy.

To summarize all these issues, here is an example of a fork that takes care of freeing the parent process (Example 10-14).

Example 10-14. fork2.pl

use Apache::SubProcess;
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
    # Parent runs this block
}
else {
    # Child runs this block
    $r->cleanup_for_exec( ); # untie the socket
    chdir '/' or die "Can't chdir to /: $!";
    close STDIN;
    close STDOUT;
    close STDERR;

    # some code goes here

    CORE::exit(0);
}
# possibly more code here usually run by the parent

Of course, the real code should be placed between freeing the parent code and the child process termination.

10.2.3. Detaching the Forked Process

Now what happens if the forked process is running and we decide that we need to restart the web server? This forked process will be aborted, because when the parent process dies during the restart, it will kill its child processes as well. In order to avoid this, we need to detach the process from its parent session by opening a new session with help of a setsid( ) system call (provided by the POSIX module). This is demonstrated in Example 10-15.

Example 10-15. fork3.pl

use POSIX 'setsid';

defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
    # Parent runs this block
}
else {
    # Child runs this block
    setsid or die "Can't start a new session: $!";
    # ...
}

Now the spawned child process has a life of its own, and it doesn't depend on the parent any more.

10.2.4. Avoiding Zombie Processes

Normally, every process has a parent. Many processes are children of the init process, whose PID is 1. When you fork a process, you must wait( ) or waitpid( ) for it to finish. If you don't wait( ) for it, it becomes a zombie.

A zombie is a process that doesn't have a parent. When the child quits, it reports the termination to its parent. If no parent wait( )s to collect the exit status of the child, it gets confused and becomes a ghost process that can be seen as a process but not killed. It will be killed only when you stop the parent process that spawned it.

Generally, the ps(1) utility displays these processes with the <defunc> tag, and you may see the zombies counter increment when using top( ). These zombie processes can take up system resources and are generally undesirable.

The proper way to do a fork, to avoid zombie processes, is shown in Example 10-16.

Example 10-16. fork4.pl

my $r = shift;
$r->send_http_header('text/plain');

defined (my $kid = fork) or die "Cannot fork: $!";
if ($kid) {
    waitpid($kid,0);
    print "Parent has finished\n";
}
else {
    # do something
    CORE::exit(0);
}

In most cases, the only reason you would want to fork is when you need to spawn a process that will take a long time to complete. So if the Apache process that spawns this new child process has to wait for it to finish, you have gained nothing. You can neither wait for its completion (because you don't have the time to) nor continue, because if you do you will get yet another zombie process. This is called a blocking call, since the process is blocked from doing anything else until this call gets completed.

The simplest solution is to ignore your dead children. Just add this line before the fork( ) call:

$SIG{CHLD} = 'IGNORE';

When you set the CHLD (SIGCHLD in C) signal handler to 'IGNORE', all the processes will be collected by the init process and therefore will be prevented from becoming zombies. This doesn't work everywhere, but it has been proven to work at least on Linux.

Note that you cannot localize this setting with local( ). If you try, it won't have the desired effect.

The latest version of the code is shown in Example 10-17.

Example 10-17. fork5.pl

my $r = shift;
$r->send_http_header('text/plain');

$SIG{CHLD} = 'IGNORE';

defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
    print "Parent has finished\n";
}
else {
    # do something time-consuming
    CORE::exit(0);
}

Note that the waitpid( ) call is gone. The $SIG{CHLD} = 'IGNORE'; statement protects us from zombies, as explained above.

Another solution (more portable, but slightly more expensive) is to use a double fork approach, as shown in Example 10-18.

Example 10-18. fork6.pl

my $r = shift;
$r->send_http_header('text/plain');

defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
    waitpid($kid,0);
}
else {
    defined (my $grandkid = fork) or die "Kid cannot fork: $!\n";
    if ($grandkid) {
        CORE::exit(0);
    }
    else {
        # code here
        # do something long lasting
        CORE::exit(0);
    }
}

Grandkid becomes a child of init—i.e., a child of the process whose PID is 1.

Note that the previous two solutions do allow you to determine the exit status of the process, but in our example, we don't care about it.

Yet another solution is to use a different SIGCHLD handler:

use POSIX 'WNOHANG';
$SIG{CHLD} = sub { while( waitpid(-1,WNOHANG)>0 ) {  } };

This is useful when you fork( ) more than one process. The handler could call wait( ) as well, but for a variety of reasons involving the handling of stopped processes and the rare event in which two children exit at nearly the same moment, the best technique is to call waitpid( ) in a tight loop with a first argument of -1 and a second argument of WNOHANG. Together these arguments tell waitpid( ) to reap the next child that's available and prevent the call from blocking if there happens to be no child ready for reaping. The handler will loop until waitpid( ) returns a negative number or zero, indicating that no more reapable children remain.

While testing and debugging code that uses one of the above examples, you might want to write debug information to the error_log file so that you know what's happening.

Read the perlipc manpage for more information about signal handlers.

10.2.5. A Complete Fork Example

Now let's put all the bits of code together and show a well-written example that solves all the problems discussed so far. We will use an Apache::Registryscript for this purpose. Our script is shown in Example 10-19.

Example 10-19. proper_fork1.pl

use strict;
use POSIX 'setsid';
use Apache::SubProcess;

my $r = shift;
$r->send_http_header("text/plain");

$SIG{CHLD} = 'IGNORE';
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
    print "Parent $$ has finished, kid's PID: $kid\n";
}
else {
    $r->cleanup_for_exec( ); # untie the socket
    chdir '/'                 or die "Can't chdir to /: $!";
    open STDIN, '/dev/null'   or die "Can't read /dev/null: $!";
    open STDOUT, '>/dev/null' or die "Can't write to /dev/null: $!";
    open STDERR, '>/tmp/log'  or die "Can't write to /tmp/log: $!";
    setsid                    or die "Can't start a new session: $!";

    my $oldfh = select STDERR;
    local $| = 1;
    select $oldfh;
    warn "started\n";

    # do something time-consuming
    sleep 1, warn "$_\n" for 1..20;
    warn "completed\n";

    CORE::exit(0); # terminate the process
}

The script starts with the usual declaration of strict mode, then loads the POSIX and Apache::SubProcess modules and imports the setsid( )symbol from the POSIX package.

The HTTP header is sent next, with the Content-Type of text/plain. To avoid zombies, the parent process gets ready to ignore the child, and the fork is called.

The if condition evaluates to a true value for the parent process and to a false value for the child process; therefore, the first block is executed by the parent and the second by the child.

The parent process announces its PID and the PID of the spawned process, and finishes its block. If there is any code outside the ifstatement, it will be executed by the parent as well.

The child process starts its code by disconnecting from the socket, changing its current directory to /, and opening the STDIN and STDOUTstreams to /dev/null (this has the effect of closing them both before opening them). In fact, in this example we don't need either of these, so we could just close( ) both. The child process completes its disengagement from the parent process by opening the STDERRstream to /tmp/log, so it can write to that file, and creates a new session with the help of setsid( ). Now the child process has nothing to do with the parent process and can do the actual processing that it has to do. In our example, it outputs a series of warnings, which are logged to /tmp/log:

my $oldfh = select STDERR;
local $| = 1;
select $oldfh;
warn "started\n";
# do something time-consuming
sleep 1, warn "$_\n" for 1..20;
warn "completed\n";

We set $|=1 to unbuffer the STDERRstream, so we can immediately see the debug output generated by the program. We use the keyword localso that buffering in other processes is not affected. In fact, we don't really need to unbuffer output when it is generated by warn( ). You want it if you use print( ) to debug.

Finally, the child process terminates by calling:

CORE::exit(0);

which makes sure that it terminates at the end of the block and won't run some code that it's not supposed to run.

This code example will allow you to verify that indeed the spawned child process has its own life, and that its parent is free as well. Simply issue a request that will run this script, see that the process starts writing warnings to the file /tmp/log, and issue a complete server stop and start. If everything is correct, the server will successfully restart and the long-term process will still be running. You will know that it's still running if the warnings are still being written into /tmp/log. If Apache takes a long time to stop and restart, you may need to raise the number of warnings to make sure that you don't miss the end of the run.

If there are only five warnings to be printed, you should see the following output in the /tmp/log file:

started
1
2
3
4
5
completed

10.2.6. Starting a Long-Running External Program

What happens if we cannot just run Perl code from the spawned process? We may have a compiled utility, such as a program written in C, or a Perl program that cannot easily be converted into a module and thus called as a function. In this case, we have to use system( ), exec( ), qx( ) or `` (backticks) to start it.

When using any of these methods, and when taint mode is enabled, we must also add the following code to untaint the PATH environment variable and delete a few other insecure environment variables. This information can be found in the perlsec manpage.

$ENV{'PATH'} = '/bin:/usr/bin';
delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'};

Now all we have to do is reuse the code from the previous section.

First we move the core program into the external.pl file, then we add the shebang line so that the program will be executed by Perl, tell the program to run under taint mode (-T), possibly enable warnings mode (-w), and make it executable. These changes are shown in Example 10-20.

Example 10-20. external.pl

#!/usr/bin/perl -Tw

open STDIN, '/dev/null'   or die "Can't read /dev/null: $!";
open STDOUT, '>/dev/null' or die "Can't write to /dev/null: $!";
open STDERR, '>/tmp/log'  or die "Can't write to /tmp/log: $!";

my $oldfh = select STDERR;
local $| = 1;
select $oldfh;
warn "started\n";
# do something time-consuming
sleep 1, warn "$_\n" for 1..20;
warn "completed\n";

Now we replace the code that we moved into the external program with a call to exec( ) to run it, as shown in Example 10-21.

Example 10-21. proper_fork_exec.pl

use strict;
use POSIX 'setsid';
use Apache::SubProcess;

$ENV{'PATH'} = '/bin:/usr/bin';
delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'};

my $r = shift;
$r->send_http_header("text/html");

$SIG{CHLD} = 'IGNORE';

defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
    print "Parent has finished, kid's PID: $kid\n";
}
else {
    $r->cleanup_for_exec( ); # untie the socket
    chdir '/'                 or die "Can't chdir to /: $!";
    open STDIN, '/dev/null'   or die "Can't read /dev/null: $!";
    open STDOUT, '>/dev/null' or die "Can't write to /dev/null: $!";
    open STDERR, '>&STDOUT'   or die "Can't dup stdout: $!";
    setsid                    or die "Can't start a new session: $!";

    exec "/home/httpd/perl/external.pl" or die "Cannot execute exec: $!";
}

Notice that exec( ) never returns unless it fails to start the process. Therefore you shouldn't put any code after exec( )—it will not be executed in the case of success. Use system( ) or backticks instead if you want to continue doing other things in the process. But then you probably will want to terminate the process after the program has finished, so you will have to write:

system "/home/httpd/perl/external.pl"
    or die "Cannot execute system: $!";
CORE::exit(0);

Another important nuance is that we have to close all STDstreams in the forked process, even if the called program does that.

If the external program is written in Perl, you can pass complicated data stuctures to it using one of the methods to serialize and then restore Perl data. The Storable and FreezeThaw modules come in handy. Let's say that we have a program called master.pl (Example 10-22) calling another program called slave.pl (Example 10-23).

Example 10-22. master.pl

# we are within the mod_perl code
use Storable ( );
my @params = (foo => 1, bar => 2);
my $params = Storable::freeze(\@params);
exec "./slave.pl", $params or die "Cannot execute exec: $!";

Example 10-23. slave.pl

#!/usr/bin/perl -w
use Storable ( );
my @params = @ARGV ? @{ Storable::thaw(shift)||[  ] } : ( );
# do something

As you can see, master.pl serializes the @params data structure with Storable::freeze and passes it to slave.pl as a single \argument. slave.pl recovers it with Storable::thaw, by shifting the first value of the @ARGV array (if available). The FreezeThaw module does a very similar thing.

10.2.7. Starting a Short-Running External Program

Sometimes you need to call an external program and you cannot continue before this program completes its run (e.g., if you need it to return some result). In this case, the fork solution doesn't help. There are a few ways to execute such a program. First, you could use system( ):

system "perl -e 'print 5+5'"

You would never call the Perl interperter for doing a simple calculation like this, but for the sake of a simple example it's good enough.

The problem with this approach is that we cannot get the results printed to STDOUT. That's where backticks or qx( ) can help. If you use either:

my $result = `perl -e 'print 5+5'`;

or:

my $result = qx{perl -e 'print 5+5'};

the whole output of the external program will be stored in the $result variable.

Of course, you can use other solutions, such as opening a pipe (|) to the program if you need to submit many arguments. And there are more evolved solutions provided by other Perl modules, such as IPC::Open2 and IPC::Open3, that allow you to open a process for reading, writing, and error handling.

10.2.8. Executing system( ) or exec( ) in the Right Way

The Perl exec( ) and system( ) functions behave identically in the way they spawn a program. Let's use system( ) as an example. Consider the following code:

system("echo", "Hi");

Perl will use the first argument as a program to execute, find the echo executable along the search path, invoke it directly, and pass the string "Hi" as an argument.

Note that Perl's system( ) is not the same as the standard libc system(3) call.

If there is more than one argument to system( ) or exec( ), or the argument is an array with more than one element in it, the arguments are passed directly to the C-level functions. When the argument is a single scalar or an array with only a single scalar in it, it will first be checked to see if it contains any shell metacharacters (e.g., *, ?). If there are any, the Perl interpreter invokes a real shell program (/bin/sh -c on Unix platforms). If there are no shell metacharacters in the argument, it is split into words and passed directly to the C level, which is more efficient.

In other words, only if you do:

system "echo *"

will Perl actually exec( ) a copy of /bin/sh to parse your command, which may incur a slight overhead on certain OSes.

It's especially important to remember to run your code with taint mode enabled when system( ) or exec( ) is called using a single argument. There can be bad consequences if user input gets to the shell without proper laundering first. Taint mode will alert you when such a condition happens.

Perl will try to do the most efficient thing no matter how the arguments are passed, and the additional overhead may be incurred only if you need the shell to expand some metacharacters before doing the actual call.