Subject: The old threaded app paging and dying problem
To: None <port-sparc@netbsd.org, port-sparc64@netbsd.org>
From: Geoff Adams <gadams@avernus.com>
List: port-sparc64
Date: 08/07/2006 02:05:58
I'm still running into this problem, even in -current (3.99.24). That
is, any program that uses pthreads will die, sooner or later. As I
understand it, this happens when some or all of the threaded program
is paged in.
This makes it increasingly hard to use the otherwise ideally suited
NetBSD/sparc{,64} as a server platform for some significant
applications, such as email, where milters are highly desirable and
inherently threaded, or web serving, where I want to run Ruby code on
the back end. Fortunately, bind9 can be compiled without thread support.
What can we do about this issue? I assume it's still there because
it's hard to reproduce in a test harness to find out just what's
wrong. I couldn't find any recent traffic on the mailing lists about
this issue. Has anybody looked into this problem recently? Are there
any clues about where to look? It seems to affect only my sparc and
sparc64 hosts, and not my alphas, so my first guess is that it's
either in md code or it's something like an alignment problem in mi
code that doesn't cause problems on many ports.
However, Chuck Silvers's post to port-macppc <http://mail-
index.netbsd.org/port-macppc/2005/02/03/0001.html> a year and a half
ago would indicate that this problem is not limited to the sparc
ports. He refers to mycroft's 'ibpthread hacks,' which have long
since been committed to the tree, and so appear in both the netbsd-3
branch and the trunk. (Some of his changes are wrapped in '#ifdef
PTHREAD_MLOCK_KLUDGE' and '#ifdef PTHREAD__DEBUG', so I was going to
rebuild libpthread with those defined, but the default build of
libpthread already defines PTHREAD_MLOCK_KLUDGE. And still, my
threaded processes die.)
So, not knowing where to start, I ran 'ktrace /usr/pkg/sbin/named -u
named -t /var/chroot/named -g'. It crashed some minutes later. The
last lines of the 'kdump -R' looked like this:
2257 5 named 0.000151492 CALL setcontext(0x2afff480)
2257 5 named 0.000043497 RET setcontext JUSTRETURN
2257 2 named 0.000906950 SAU blocked, event=
[<ctx=0x24fffe40, id=2, cpu=0>]
2257 2 named 0.000124993 CALL setcontext(0x217ff800)
2257 2 named 0.000043997 RET setcontext JUSTRETURN
2257 2 named 0.000042498 CALL sa_yield
2257 2 named 0.003410810 SAU unblocked, event=
[<ctx=0x24fffe40, id=2, cpu=0>], intr=[<ctx=0x2afff108, id=5, cpu=0>]
2257 2 named 0.000048497 RET sa_yield JUSTRETURN
2257 5 named 0.011603353 SAU blocked, event=
[<ctx=0x257ffe40, id=5, cpu=0>]
2257 5 named 0.000243986 SAU unblocked, event=
[<ctx=0x257ffe40, id=5, cpu=0>], intr=[<ctx=0x2affec50, id=2, cpu=0>]
2257 5 named 0.000135493 CALL setcontext(0x2affec50)
2257 5 named 0.000041998 RET setcontext JUSTRETURN
2257 5 named 0.000093994 PSIG SIGSEGV SIG_DFL
2257 3 named 0.000387979 RET select -1 errno 4
Interrupted system call
2257 1 named 0.000122493 RET __sigtimedwait -1 errno 87
Operation Canceled
A second time, named ran for over an hour, and then died with a
SIGBUS, rather than SIGSEGV:
18342 5 named 0.000153492 CALL setcontext(0x2afff5f0)
18342 5 named 0.000047997 RET setcontext JUSTRETURN
18342 2 named 0.000814954 SAU blocked, event=
[<ctx=0x23fffe40, id=2, cpu=0>]
18342 2 named 0.000121994 CALL setcontext(0x217ff800)
18342 2 named 0.000046997 RET setcontext JUSTRETURN
18342 2 named 0.000040998 CALL sa_yield
18342 2 named 0.018910942 SAU unblocked, event=
[<ctx=0x23fffe40, id=2, cpu=0>], intr=[<ctx=0x2afff1d0, id=5, cpu=0>]
18342 2 named 0.000054997 RET sa_yield JUSTRETURN
18342 5 named 0.029272363 SAU blocked, event=
[<ctx=0x247ffe40, id=5, cpu=0>]
18342 5 named 0.000246986 SAU unblocked, event=
[<ctx=0x247ffe40, id=5, cpu=0>], intr=[<ctx=0x2affed18, id=2, cpu=0>]
18342 5 named 0.000136992 CALL setcontext(0x2affed18)
18342 5 named 0.000046498 RET setcontext JUSTRETURN
18342 5 named 0.000059996 PSIG SIGBUS SIG_DFL
18342 3 named 0.021391804 RET select -1 errno 4 Interrupted
system call
18342 1 named 0.000131993 RET __sigtimedwait -1 errno 87
Operation Canceled
How can I help solve this problem?
- Geoff