Subject: kern/34101: ltsleep during panic hangs system
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <jld@panix.com>
List: netbsd-bugs
Date: 07/28/2006 03:15:00
>Number: 34101
>Category: kern
>Synopsis: ltsleep during panic hangs system
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Jul 28 03:15:00 +0000 2006
>Originator: Jed Davis
>Release: NetBSD 3.0
>Organization:
PANIX Public Access Internet and UNIX, NYC
>Environment:
System: NetBSD panix3.panix.com 3.0 NetBSD 3.0 (PANIX-FIVE) #0: Fri Apr 14 21:05:29 EDT 2006 root@juggler.panix.com:/devel/netbsd/3.0/src/sys/arch/i386/compile/PANIX-FIVE i386
Architecture: i386
Machine: i386
>Description:
The top of ltsleep() contains this:
/*
* XXXSMP
* This is probably bogus. Figure out what the right
* thing to do here really is.
* Note that not sleeping if ltsleep is called with curlwp == NULL
* in the shutdown case is disgusting but partly necessary given
* how shutdown (barely) works.
*/
if (cold || (doing_shutdown && (panicstr || (l == NULL)))) {
/*
* After a panic, or during autoconfiguration,
* just give interrupts a chance, then just return;
* don't run any other procs or panic below,
* in case this is the idle process and already asleep.
*/
The problem with that is that, if the system is panicking and trying
to reboot (which may include an attempt to sync disks), and a kernel
thread that loops calling ltsleep to wait for work (e.g., aiodoned, or
i386's MD apm_thread) gets woken up, it will run forever and the
system will never succeed in rebooting.
However, it appears to be like that for a reason, and thus that the
correct solution is not to just yank it out and try to sleep normally.
PR port-i386/33353 was opened to the specific instance of this problem
with apm_thread, in which special case it might be reasonable to have
the affected thread just exit if it's woken during a panic -- but that
seems like not the right solution somehow (even if it'd work).
>How-To-Repeat:
This happens most of the time when a host at Panix experiences a panic;
enough that we've had to locally modify swwdog(4) to pass RB_NOSYNC and
use it as a workaround.
>Fix:
That's what I'm filing this PR to find out. A somewhat distasteful
workaround is noted above.