current-users: Re: After newlock2 merge: Different pthread behaviorforuserlandprograms?

Subject: Re: After newlock2 merge: Different pthread behaviorforuserlandprograms?
To: Matthias Drochner <M.Drochner@fz-juelich.de>
From: Andrew Doran <ad@NetBSD.org>
List: current-users
Date: 04/14/2007 12:57:24
Hi,

On Thu, Apr 12, 2007 at 08:14:22PM +0200, Matthias Drochner wrote:

> ad@NetBSD.org said:
> > I've seen a similar trace recently from a FUSE app (pthread_spinlock),
> > I'll have look in the next few days.  Apparently it's not hard to
> > reproduce the problem. 
> 
> I was hit again, with today's kernel. With both CPUs enabled,
> and not running setiathome. As said, I've never seen these problems
> if using just one CPU, or if I keep both CPUs busy.
> 
> xfce-mcs-manager died at the same point - the assertion after a
> pthread cancel check. I didn't find a call to pthread_cancel
> in the glib sources, so I suspect that the check firing is
> already an indication of corruption.
> 
> Program terminated with signal 6, Aborted.
> #0  0xbb31819f in kill () from /usr/lib/libc.so.12
> (gdb) where
> #0  0xbb31819f in kill () from /usr/lib/libc.so.12
> #1  0xbb3e01f7 in pthread__assertfunc () from /usr/lib/libpthread.so.0
> #2  0xbb3dedba in pthread_spinlock () from /usr/lib/libpthread.so.0
> #3  0xbb3e103d in pthread_exit () from /usr/lib/libpthread.so.0
> #4  0xbb3de804 in poll () from /usr/lib/libpthread.so.0
> #5  0xbb416caf in g_main_context_check () from /usr/pkg/lib/libglib-2.0.so.0
> (gdb) x/100i poll
> [...]
> 0xbb3de7d3 <poll+31>:   mov    0x1c(%esi),%eax
> 0xbb3de7d6 <poll+34>:   test   %eax,%eax
> 0xbb3de7d8 <poll+36>:   jne    0xbb3de7fa <poll+70>
> 0xbb3de7da <poll+38>:   push   %eax
> 0xbb3de7db <poll+39>:   pushl  0x10(%ebp)
> 0xbb3de7de <poll+42>:   pushl  0xc(%ebp)
> 0xbb3de7e1 <poll+45>:   pushl  0x8(%ebp)
> 0xbb3de7e4 <poll+48>:   call   0xbb3dbcc0 <_sys_poll@plt>
> 0xbb3de7e9 <poll+53>:   add    $0x10,%esp
> 0xbb3de7ec <poll+56>:   mov    0x1c(%esi),%esi
> 0xbb3de7ef <poll+59>:   test   %esi,%esi
> 0xbb3de7f1 <poll+61>:   jne    0xbb3de7fa <poll+70>
> 0xbb3de7f3 <poll+63>:   lea    0xfffffff8(%ebp),%esp
> 0xbb3de7f6 <poll+66>:   pop    %ebx
> 0xbb3de7f7 <poll+67>:   pop    %esi
> 0xbb3de7f8 <poll+68>:   leave  
> 0xbb3de7f9 <poll+69>:   ret    
> 0xbb3de7fa <poll+70>:   sub    $0xc,%esp
> 0xbb3de7fd <poll+73>:   push   $0x1
> 0xbb3de7ff <poll+75>:   call   0xbb3dbae0 <pthread_exit@plt>
> 0xbb3de804 <open>:      push   %ebp

The assertion suggests that pthread_self() is returning junk.

> When I tried to rebuild userland, /bin/sh died unexpectedly in
> a way which looks impossible:
> 
> Program terminated with signal 11, Segmentation fault.
> #0  0x0805aadc in setvar ()
> (gdb) where
> #0  0x0805aadc in setvar ()
> #1  0x08055d51 in readcmd ()
> #2  0x0804c594 in evalcommand ()
> #3  0x0804ba6c in evaltree ()
> #4  0x0804cfe5 in evalloop ()
> #5  0x0804bae8 in evaltree ()
> #6  0x0804cc19 in evalpipe ()
> #7  0x0804ba5a in evaltree ()
> #8  0x0804ba1d in evaltree ()
> #9  0x0804d0ba in evalstring ()
> #10 0x08054f26 in main ()
> (gdb) x/i setvar
> [...]
> 0x805aad9 <setvar+57>:  lea    0x1(%esi),%ecx
> (gdb) 
> 0x805aadc <setvar+60>:  mov    (%ecx),%dl
> (gdb) info reg
> eax            0x0      0
> ecx            0x806c000        134660096
> edx            0x8069e00        134651392
> ebx            0xbbbb3c00       -1145357312
> esp            0xbfbfdd20       0xbfbfdd20
> ebp            0xbfbfdd38       0xbfbfdd38
> esi            0x8069ec4        134651588
> edi            0x1      1
> eip            0x805aadc        0x805aadc <setvar+60>
> eflags         0x10216  [ PF AF IF RF ]
> cs             0x17     23
> ss             0x1f     31
> ds             0x1f     31
> es             0x1f     31
> fs             0x1f     31
> gs             0x1f     31
> (gdb) x/x 0x8069ec4
> 0x8069ec4:      0x69667a74
> (gdb) x/x 0x806c000
> 0x806c000:      Cannot access memory at address 0x806c000
> 
> 
> As you see, either esi or ecx must be wrong here.
> It might be a strange coincidence that the xfce crash can
> be explained by a corruption of esi...
> 
> I've kept the coredumps and binaries, in case someone
> wants to do analyze more.

Would you be willing to put these up somewhere I can take a look?

Andrew