Subject: bin/20663: deadlock in cron(8)
To: None <gnats-bugs@gnats.netbsd.org>
From: None <p@ppires.org>
List: netbsd-bugs
Date: 03/11/2003 16:19:05
>Number: 20663
>Category: bin
>Synopsis: temporary system errors can cause cron jobs to hang forever
>Confidential: no
>Severity: critical
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Mar 11 11:20:00 PST 2003
>Closed-Date:
>Last-Modified:
>Originator: Paulo A. P. Pires
>Release: NetBSD 1.6P (2003/03/09)
>Organization:
>Environment:
System: NetBSD mateus.ppires.org 1.6P NetBSD 1.6P (MATEUS-20030309) #0: Sun Mar 9 21:28:08 BRT 2003 Pappires@mateus.ppires.org:/usr/src/sys/arch/i386/compile/MATEUS-20030309 i386
Architecture: i386
Machine: i386
>Description:
Under certain occasions, especially error or abnormal system
conditions, cron child processes get into some frozen state.
One particular situation where the problem happens frequently is
when a NIS (or/and NFS) server becomes unavailable to a client
where the cron daemon runs.
For each cron job, cron daemon forks a child, which forks (with
vfork()) a grandchild, which runs the command for the job. If
the NIS server becomes unavailable, one of these processes hangs,
and so does the other, since there is a pipe between the child and
the grandchild.
Even after NIS operation resumes, neither process continues.
ktrace(8) on either process shows no activity. ps(1) shows something
like shown below.
# ps -ax | grep cron
252 ?? Ss 0:00.63 /usr/sbin/cron
808 ?? SW 0:00.00 /USR/SBIN/CRON (cron)
809 ?? IWVs 0:00.00 /USR/SBIN/CRON (cron)
(... dozens other pairs of CRONs ...)
(Process 809 was in "pipewr" state -- unfortunately, I didn't
record what state pid 808 was in.)
Before killing process 809, I tried to send STOP and CONT signals
to wake it up, but it didn't work. However, SIGTERM was enough
to kill it (no "kill -9" was necessary). When it was killed,
process 808 resumed, reading what process 809 had output before
hanging (260 lines saying "clntudp_create: RPC: Port mapper
failure - RPC: Unable to send"), and sending this output back to
crontab owner with sendmail.
From the observation, it seems that the problem is that the
grandchild, which is created with vfork(), completely fills
its pipe output buffer with messages intended to go to stderr,
before having the opportunity to call execve() or _exit(). As
the grandchild never reaches those syscalls, its "parent" is
never unblocked to consume piped data, thus causing a deadlock.
This problem is serious because it reveals a design fault that
prevents the program from recovering from a relatively common
_temporary_ error situation, but it becomes critical on busier
machines, where more crontabs or crontab jobs could lead to lots
of hung processes, wasting both memory and process slots, possibly
leading to resource exhaustion or intentional denial of service.
>How-To-Repeat:
Break access to a NIS server from a client running cron with
active crontabs and watch the pile of processes grow.
>Fix:
I believe that changing vfork() to ordinary fork() is a good
starting point. It is likely, however, that more changes need
to be done, since it seems, from a quick glance, that current
code relies on pure vfork() semantics.
I looked in FreeBSD cvs repository, and this bug is there, too,
but I am not sure if it honors original vfork() semantics.
OpenBSD banned vfork() from do_command.c, so it may be useful
to look at their change closely.
Unfortunately, I don't have how to test it soon, as I am moving
and my development NetBSD machine will stay off until I can set
a place for it in my new home, and I prefer avoiding playing with
my production machines.
>Release-Note:
>Audit-Trail:
>Unformatted: