home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


16.19. Avoiding Zombie Processes

Problem

Your program forks children, but the dead children accumulate, fill up your process table, and aggravate your system administrator.

Solution

If you don't need to record the children that have terminated, use:

$SIG{CHLD} = 'IGNORE';

To keep better track of deceased children, install a SIGCHLD handler to call waitpid :

use POSIX ":sys_wait_h";

$SIG{CHLD} = \&REAPER;
sub REAPER {
    my $stiff;
    while (($stiff = waitpid(-1, &WNOHANG)) > 0) {
        # do something with $stiff if you want
    }
    $SIG{CHLD} = \&REAPER;                  # install *after* calling waitpid
}

Discussion

When a process exits, the system keeps it in the process table so the parent can check its status - whether it terminated normally or abnormally. Fetching a child's status (thereby freeing it to drop from the system altogether) is rather grimly called reaping dead children. (This entire recipe is full of ways to harvest your dead children. If this makes you queasy, we understand.) It involves a call to wait or waitpid . Some Perl functions (piped open s, system , and backticks) will automatically reap the children they make, but you must explicitly wait when you use fork to manually start another process.

To avoid accumulating dead children, simply tell the system that you're not interested in them by setting $SIG{CHLD} to "IGNORE" . If you want to know which children die and when, you'll need to use waitpid .

The waitpid function reaps a single process. Its first argument is the process to wait for - use -1 to mean any process - and its second argument is a set of flags. We use the WNOHANG flag to make waitpid immediately return 0 if there are no dead children. A flag value of 0 is supported everywhere, indicating a blocking wait. Call waitpid from a SIGCHLD handler, as we do in the Solution, to reap the children as soon as they die.

The wait function also reaps children, but it does not have a non-blocking option. If you inadvertently call it when there are running child processes but none have exited, your program will pause until there is a dead child.

Because the kernel keeps track of undelivered signals using a bit vector, one bit per signal, if two children die before your process is scheduled, you will get only a single SIGCHLD. You must always loop when you reap in a SIGCHLD handler, and so you can't use wait .

Both wait and waitpid return the process ID that they just reaped and set $? to the wait status of the defunct process. This status is actually two 8-bit values in one 16-bit number. The high byte is the exit value of the process. The low 7 bits represent the number of the signal that killed the process, with the 8th bit indicating whether a core dump occurred. Here's one way to isolate those values:

$exit_value  = $? >> 8;
$signal_num  = $? & 127;
$dumped_core = $? & 128;

The standard POSIX module has macros to interrogate status values: WIFEXITED, WEXITSTATUS, WIFSIGNALLED, and WTERMSIG. Oddly enough, POSIX doesn't have a macro to test whether the process core dumped.

Beware of two things when using SIGCHLD. First, the system doesn't just send a SIGCHLD when a child exits; it also sends one when the child stops. A process can stop for many reasons - waiting to be foregrounded so it can do terminal I/O, being sent a SIGSTOP (it will wait for the SIGCONT before continuing), or being suspended from its terminal. You need to check the status with the WIFEXITED [ 1 ] function from the POSIX module to make sure you're dealing with a process that really died, and isn't just suspended.

[1] Not SPOUSEXITED , even on a PC.

use POSIX qw(:signal_h :errno_h :sys_wait_h);

$SIG{CHLD} = \&REAPER;
sub REAPER {
    my $pid;

    $pid = waitpid(-1, &WNOHANG);

    if ($pid == -1) {
        # no child waiting.  Ignore it.
    } elsif (WIFEXITED($?)) {
        print "Process $pid exited.\n";
    } else {
        print "False alarm on $pid.\n";
    }
    $SIG{CHLD} = \&REAPER;          # in case of unreliable signals
}

The second trap with SIGCHLD is related to Perl, not the operating system. Because system , open , and backticks all fork subprocesses and the operating system sends your process a SIGCHLD whenever any of its subprocesses exit, you could get called for something you weren't expecting. The built-in operations all wait for the child themselves, so sometimes the SIGCHLD will arrive before the close on the filehandle blocks to reap it. If the signal handler gets to it first, the zombie won't be there for the normal close. This makes close return false and set $! to "No child processes" . Then, if the close gets to the dead child first, waitpid will return 0 .

Most systems support a non-blocking waitpid . Use Perl's standard Config.pm module to find out:

use Config;
$has_nonblocking = $Config{d_waitpid} eq "define" ||
                   $Config{d_wait4}   eq "define";

System V defines SIGCLD, which has the same signal number as SIGCHLD but subtly different semantics. Use SIGCHLD to avoid confusion.

See Also

The "Signals" sections in Chapter 6 of Programming Perl and in perlipc (1); the wait and waitpid functions in Chapter 3 of Programming Perl and in perlfunc (1); the documentation for the standard POSIX module, in Chapter 7 of Programming Perl ; your system's sigaction (2), signal (3), and kill (2) manpages (if you have them); Recipe 16.17