Subject: Re: After newlock2 merge: Different pthread behaviorforuserlandprograms?
To: Matthias Drochner <M.Drochner@fz-juelich.de>
From: Andrew Doran <ad@NetBSD.org>
List: current-users
Date: 04/14/2007 12:57:24
Hi,
On Thu, Apr 12, 2007 at 08:14:22PM +0200, Matthias Drochner wrote:
> ad@NetBSD.org said:
> > I've seen a similar trace recently from a FUSE app (pthread_spinlock),
> > I'll have look in the next few days. Apparently it's not hard to
> > reproduce the problem.
>
> I was hit again, with today's kernel. With both CPUs enabled,
> and not running setiathome. As said, I've never seen these problems
> if using just one CPU, or if I keep both CPUs busy.
>
> xfce-mcs-manager died at the same point - the assertion after a
> pthread cancel check. I didn't find a call to pthread_cancel
> in the glib sources, so I suspect that the check firing is
> already an indication of corruption.
>
> Program terminated with signal 6, Aborted.
> #0 0xbb31819f in kill () from /usr/lib/libc.so.12
> (gdb) where
> #0 0xbb31819f in kill () from /usr/lib/libc.so.12
> #1 0xbb3e01f7 in pthread__assertfunc () from /usr/lib/libpthread.so.0
> #2 0xbb3dedba in pthread_spinlock () from /usr/lib/libpthread.so.0
> #3 0xbb3e103d in pthread_exit () from /usr/lib/libpthread.so.0
> #4 0xbb3de804 in poll () from /usr/lib/libpthread.so.0
> #5 0xbb416caf in g_main_context_check () from /usr/pkg/lib/libglib-2.0.so.0
> (gdb) x/100i poll
> [...]
> 0xbb3de7d3 <poll+31>: mov 0x1c(%esi),%eax
> 0xbb3de7d6 <poll+34>: test %eax,%eax
> 0xbb3de7d8 <poll+36>: jne 0xbb3de7fa <poll+70>
> 0xbb3de7da <poll+38>: push %eax
> 0xbb3de7db <poll+39>: pushl 0x10(%ebp)
> 0xbb3de7de <poll+42>: pushl 0xc(%ebp)
> 0xbb3de7e1 <poll+45>: pushl 0x8(%ebp)
> 0xbb3de7e4 <poll+48>: call 0xbb3dbcc0 <_sys_poll@plt>
> 0xbb3de7e9 <poll+53>: add $0x10,%esp
> 0xbb3de7ec <poll+56>: mov 0x1c(%esi),%esi
> 0xbb3de7ef <poll+59>: test %esi,%esi
> 0xbb3de7f1 <poll+61>: jne 0xbb3de7fa <poll+70>
> 0xbb3de7f3 <poll+63>: lea 0xfffffff8(%ebp),%esp
> 0xbb3de7f6 <poll+66>: pop %ebx
> 0xbb3de7f7 <poll+67>: pop %esi
> 0xbb3de7f8 <poll+68>: leave
> 0xbb3de7f9 <poll+69>: ret
> 0xbb3de7fa <poll+70>: sub $0xc,%esp
> 0xbb3de7fd <poll+73>: push $0x1
> 0xbb3de7ff <poll+75>: call 0xbb3dbae0 <pthread_exit@plt>
> 0xbb3de804 <open>: push %ebp
The assertion suggests that pthread_self() is returning junk.
> When I tried to rebuild userland, /bin/sh died unexpectedly in
> a way which looks impossible:
>
> Program terminated with signal 11, Segmentation fault.
> #0 0x0805aadc in setvar ()
> (gdb) where
> #0 0x0805aadc in setvar ()
> #1 0x08055d51 in readcmd ()
> #2 0x0804c594 in evalcommand ()
> #3 0x0804ba6c in evaltree ()
> #4 0x0804cfe5 in evalloop ()
> #5 0x0804bae8 in evaltree ()
> #6 0x0804cc19 in evalpipe ()
> #7 0x0804ba5a in evaltree ()
> #8 0x0804ba1d in evaltree ()
> #9 0x0804d0ba in evalstring ()
> #10 0x08054f26 in main ()
> (gdb) x/i setvar
> [...]
> 0x805aad9 <setvar+57>: lea 0x1(%esi),%ecx
> (gdb)
> 0x805aadc <setvar+60>: mov (%ecx),%dl
> (gdb) info reg
> eax 0x0 0
> ecx 0x806c000 134660096
> edx 0x8069e00 134651392
> ebx 0xbbbb3c00 -1145357312
> esp 0xbfbfdd20 0xbfbfdd20
> ebp 0xbfbfdd38 0xbfbfdd38
> esi 0x8069ec4 134651588
> edi 0x1 1
> eip 0x805aadc 0x805aadc <setvar+60>
> eflags 0x10216 [ PF AF IF RF ]
> cs 0x17 23
> ss 0x1f 31
> ds 0x1f 31
> es 0x1f 31
> fs 0x1f 31
> gs 0x1f 31
> (gdb) x/x 0x8069ec4
> 0x8069ec4: 0x69667a74
> (gdb) x/x 0x806c000
> 0x806c000: Cannot access memory at address 0x806c000
>
>
> As you see, either esi or ecx must be wrong here.
> It might be a strange coincidence that the xfce crash can
> be explained by a corruption of esi...
>
> I've kept the coredumps and binaries, in case someone
> wants to do analyze more.
Would you be willing to put these up somewhere I can take a look?
Andrew