Subject: Re: sparc64 / 2.0.1 and thread crashes (was: Re: Ultra 5 / 2.0 / panic: lockmgr: no context)
To: Michael <macallan18@earthlink.net>
From: Gert Doering <gert@greenie.muc.de>
List: port-sparc64
Date: 02/08/2005 09:37:36
Hi,
to summarize and close down this thread...
- my U5 with 2.0.1 kept crashing every night in "low-duty" periods,
with varying kernel error messages - sometimes even with RED STATE
and SIR Reset messages that pointed to hardware issues
- I've swapped nearly all relevant parts, and it didn't change the
problem
- reducing the amount of RAM from 512 Mb -> 256 Mb made the problem
appear much faster (crash after about 2-3 hours), which made me assume
"it has to do something with threads and swapping" (the "low-duty"
period where it usually crashed is related to Amanda backup - which
needs LOTS of RAM, so it's likely that other processes got swapped
out, and the next time they're needed -> *boom*)
- tried running without swap for one night: no crash, but lots of
processes died due to out of memory situation (amanda's fault) - so
this seemed to confirm that it's not a hardware problem, but
"threads+swapping" indeed.
- only two things on the system use threads: perl, and clamav-milter
- rebuilt perl (5.8.x) and all perl modules to use non-threaded perl
(clamav-milter still using native threads)
-> didn't help, machine still crashed -> so it wasn't perl/spamd
- rebuilt kernel with a backport of the -current thread changes
(the L_SA_SWITCHING stuff), rebuilt libpthreads.so with
PTHREAD_MLOCK_KLUDGE. Rebooted this kernel, waited.
Machine did not crash "in the usual way", but it ended up being
unusable in other ways:
* "top" displayed "[ioflush]" taking 100% CPU usage (indefinite)
* typing "sync" made "sync" appear in top, sharing 100% CPU usage
with "[ioflush]" (both using 50%, obviously)
* trying to umount a not-in-use filesystem (to see whether it would
trigger anything) led to "umount" hanging, consuming CPU
* assuming that "clamav-milter" might be the culprit, I tried to
kill it. Various signals were ignored, "kill -9" led to a kernel
fault:
data fault: pc=11a0434 addr=0
kernel trap 30: data access exception
Stopped in pid 27070.1 (kill) at netbsd:lwp_continue+0x20: ld [%l0 + 0x44], %g1
db>
-> so I need to assume that the current thread fixes *do* fix "sparc"
(as has been reported by others) but not yet "sparc64".
- as a last measure, I've rebuilt libmilter.a and clamav-milter to
use GNU pth (from pkgsrc) and am now running that combo, and *no*
processes that use native pthreads anymore.
Since then, the machine has NOT crashed a single time.
kirk$ uptime
9:33AM up 2 days, 13:58, 10 users, load averages: 0.98, 0.63, 0.56
(which is not something one would usually be proud of, but since the
machine has crashed every single night for the last 4 weeks, it seems
to be the break through)
I hope this summary is useful for someone out there :-)
If there are specific additional sparc64/-current patches that I should
test, just tell me. I have a different machine available (U10) that is
used as a work station, and fairly reliable crashes when running Mozilla
(native pthreads) while building a NetBSD world, or doing a "CVS update"
on the NetBSD src tree.
gert
--
USENET is *not* the non-clickable part of WWW!
//www.muc.de/~gert/
Gert Doering - Munich, Germany gert@greenie.muc.de
fax: +49-89-35655025 gert@net.informatik.tu-muenchen.de