netbsd-bugs: bin/20663: deadlock in cron(8)

Subject: bin/20663: deadlock in cron(8)
To: None <gnats-bugs@gnats.netbsd.org>
From: None <p@ppires.org>
List: netbsd-bugs
Date: 03/11/2003 16:19:05
>Number:         20663
>Category:       bin
>Synopsis:       temporary system errors can cause cron jobs to hang forever
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Mar 11 11:20:00 PST 2003
>Closed-Date:
>Last-Modified:
>Originator:     Paulo A. P. Pires
>Release:        NetBSD 1.6P (2003/03/09)
>Organization:
	
>Environment:
	
	
System: NetBSD mateus.ppires.org 1.6P NetBSD 1.6P (MATEUS-20030309) #0: Sun Mar 9 21:28:08 BRT 2003 Pappires@mateus.ppires.org:/usr/src/sys/arch/i386/compile/MATEUS-20030309 i386
Architecture: i386
Machine: i386

>Description:
	Under certain occasions, especially error or abnormal system
	conditions, cron child processes get into some frozen state.
	One particular situation where the problem happens frequently is
	when a NIS (or/and NFS) server becomes unavailable to a client
	where the cron daemon runs.

	For each cron job, cron daemon forks a child, which forks (with
	vfork()) a grandchild, which runs the command for the job.  If
	the NIS server becomes unavailable, one of these processes hangs,
	and so does the other, since there is a pipe between the child and
	the grandchild.

	Even after NIS operation resumes, neither process continues.
	ktrace(8) on either process shows no activity.  ps(1) shows something
	like shown below.

		# ps -ax | grep cron
		252 ?? Ss    0:00.63 /usr/sbin/cron
		808 ?? SW    0:00.00 /USR/SBIN/CRON (cron)
		809 ?? IWVs  0:00.00 /USR/SBIN/CRON (cron)
		    (... dozens other pairs of CRONs ...)

	(Process 809 was in "pipewr" state -- unfortunately, I didn't
	record what state pid 808 was in.)

	Before killing process 809, I tried to send STOP and CONT signals
	to wake it up, but it didn't work.  However, SIGTERM was enough
	to kill it (no "kill -9" was necessary).  When it was killed,
	process 808 resumed, reading what process 809 had output before
	hanging (260 lines saying "clntudp_create: RPC: Port mapper
	failure - RPC: Unable to send"), and sending this output back to
	crontab owner with sendmail.

	From the observation, it seems that the problem is that the
	grandchild, which is created with vfork(), completely fills
	its pipe output buffer with messages intended to go to stderr,
	before having the opportunity to call execve() or _exit().  As
	the grandchild never reaches those syscalls, its "parent" is
	never unblocked to consume piped data, thus causing a deadlock.

	This problem is serious because it reveals a design fault that
	prevents the program from recovering from a relatively common
	_temporary_ error situation, but it becomes critical on busier
	machines, where more crontabs or crontab jobs could lead to lots
	of hung processes, wasting both memory and process slots, possibly
	leading to resource exhaustion or intentional denial of service.

>How-To-Repeat:
	Break access to a NIS server from a client running cron with
	active crontabs and watch the pile of processes grow.

>Fix:
	I believe that changing vfork() to ordinary fork() is a good
	starting point.  It is likely, however, that more changes need
	to be done, since it seems, from a quick glance, that current
	code relies on pure vfork() semantics.

	I looked in FreeBSD cvs repository, and this bug is there, too,
	but I am not sure if it honors original vfork() semantics.
	OpenBSD banned vfork() from do_command.c, so it may be useful
	to look at their change closely.

	Unfortunately, I don't have how to test it soon, as I am moving
	and my development NetBSD machine will stay off until I can set
	a place for it in my new home, and I prefer avoiding playing with
	my production machines.
>Release-Note:
>Audit-Trail:
>Unformatted: