Subject: port-m68k/35099: pthread programs core on m68k
To: None <port-m68k-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: None <stix@stix.id.au>
List: netbsd-bugs
Date: 11/23/2006 07:10:00
>Number: 35099
>Category: port-m68k
>Synopsis: pthread programs core on m68k
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: port-m68k-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Nov 23 07:10:00 +0000 2006
>Originator: Paul Ripke
>Release: NetBSD 4.99.4 (-current 20061122ish)
>Organization:
>Environment:
System: NetBSD kitt.stix.org.au 4.99.4 NetBSD 4.99.4 (KITT) #0: Tue Nov 21 22:16:31 EST 2006 stix@zion.stix.org.au:/export/netbsd/current/obj.mac68k/export/netbsd/current/src/sys/arch/mac68k/compile/KITT mac68k
Architecture: m68k
Machine: mac68k
>Description:
Many pthread programs get SIGILL after a while. They appear to need to
have > 1 LWP (ie. not just switching in userspace). Since named(8) is now
threaded, it regularly will die with a SIGILL.
>How-To-Repeat:
Using "fblckgen" from http://stix.id.au/wiki/iotools as a simple-ish test
(it only has two threads, for starters):
ksh$ PTHREAD_DEBUGLOG=1 time ./fblckgen -ab 4k -c 0 | cat > /dev/null
time: Command terminated abnormally.
11.90 real 2.48 user 3.70 sys
The "cat" above is required to get NLWP>1. Unfortunately, gdb cores trying
to analyse the core:
ksh$ gdb fblckgen fblckgen.core
GNU gdb 5.3nb1
...
Core was generated by `fblckgen'.
Program terminated with signal 4, Illegal instruction.
Reading symbols from /usr/lib/libpthread.so.0...done.
Loaded symbols for /usr/lib/libpthread.so.0
Reading symbols from /usr/lib/libc.so.12...done.
Loaded symbols for /usr/lib/libc.so.12
Reading symbols from /usr/libexec/ld.elf_so...done.
Loaded symbols for /usr/libexec/ld.elf_so
#0 0x049ffbe4 in ?? ()
(gdb) thr app all bt
Thread 3 (Thread 22 ()):
#0 0x04025284 in pthread__locked_switch () from /usr/lib/libpthread.so.0
#1 0x06bffb78 in ?? ()
Memory fault (core dumped)
The debuglog always ends the same (with different addresses):
ksh$ debuglog -k | tail -20
(up 0x4e00000) sigev val 88880020
(up 0x4e00000) switching to 0xffe00000 (uc: U 0xffffb200 pc: 4025284)
(recycle 0xffe00000) recycling 0x4e00000
(up 0x4e00000) type 5 LWP 2 ev 0 intr 1
(fi 0x4e00000) victim 2 0x6a00000(1) lockholder 1
(rl 0x4e00000) entered
(rl 0x4e00000) intqueue 0x6a00000
(rl 0x4e00000) victim 0x6a00000 (uc T 0x6bffb6c) normal spinlocks: 1
(rl 0x4e00000) starting chain 0x6a00000 (uc T 0x6bffb6c pc 4029d08 sp 6bfff6c)
(rl 0x4e00000) returned from chain
(rl 0x4e00000) intqueue 0x6a00000
(rl 0x4e00000) victim 0x6a00000 (uc U 0x6bffb78) normal heldlock: 0x6690 switchto: 0xffe00000 (uc 0xffffb200 pc 4025284)
(rl 0x4e00000) exiting
(up 0x4e00000) sigev val 88880020
(up 0x4e00000) switching to 0xffe00000 (uc: U 0xffffb200 pc: 4025284)
(recycle 0xffe00000) recycling 0x4e00000
(up 0x4e00000) type 2 LWP 3 ev 1 intr 0
(up 0x4e00000) blocker 2 0xffe00000(1)
(up 0x4e00000) switching to 0x6a00000 (uc: U 0x6bffb78 pc: 4025284)
(recycle 0x6a00000) recycling 0x4e00000
Previously, with what was tagged as netbsd-4, before gcc4, etc, gdb would
get the following out of the core:
Thread 3 (Thread 22 ()):
#0 0x04023174 in pthread__locked_switch () from /usr/lib/libpthread.so.0
#1 0x06bffb70 in ?? ()
#2 0x040283b2 in pthread_cond_wait () from /usr/lib/libpthread.so.0
#3 0x00003548 in makeBlocks (dummy=0x0) at fblckgen.c:234
#4 0x040296ec in pthread_create () from /usr/lib/libpthread.so.0
Thread 2 (LWP 1):
#0 0x040584c2 in write () from /usr/lib/libc.so.12
#1 0x04022fca in write () from /usr/lib/libpthread.so.0
#2 0x000031be in main (argc=65536, argv=0x0) at fblckgen.c:179
Thread 1 (LWP 2):
#0 0x049ffbe4 in ?? ()
#1 0x040283b2 in pthread_cond_wait () from /usr/lib/libpthread.so.0
#2 0x00003548 in makeBlocks (dummy=0x0) at fblckgen.c:234
#3 0x040296ec in pthread_create () from /usr/lib/libpthread.so.0
#0 0x049ffbe4 in ?? ()
Which is odd, since the process only has 2 pthreads. The address
0x049ffbe4 appears to be bogus, and different cores all feature
a similar address.
I believe this problem is already known, but I couldn't find a PR
specifically for this issue.
>Fix:
Unknown.
>Unformatted: