Subject: Re: 2.0 for sgimips broken
To: Simon Burge <simonb@wasabisystems.com>
From: Rafal Boni <rafal@pobox.com>
List: port-sgimips
Date: 05/13/2004 22:21:43
In message <20040512082626.14FB723410@thoreau.thistledown.com.au>, you write: 

-> Christopher SEKIYA wrote:
-> 
-> > I think I've isolated the commit that broke it:
-> > 	
-> > 	Module Name:    src
-> > 	Committed By:   simonb
-> > 	Date:           Tue Mar 23 02:21:49 UTC 2004
-> > 
-> > 	Modified Files:
-> > 	        src/lib/libc/arch/mips/gen: __setjmp14.S
-> > 	Added Files:
-> > 	        src/lib/libc/arch/mips/gen: __longjmp14.c
-> > 
-> > 	Log Message:
-> > 	Use setcontext() instead of sigreturn() to implement longjmp().
-> > 
-> > 	cvs rdiff -r0 -r1.1 src/lib/libc/arch/mips/gen/__longjmp14.c
-> > 	cvs rdiff -r1.9 -r1.10 src/lib/libc/arch/mips/gen/__setjmp14.S
-> > 
-> > ... at least, anything built after that commit exhibits the cache panic.
-> 
-> Yay for me :-(  Do any of the regression tests regress/lib/libc/*setjmp*
-> able to reproduce the panic, and if so is it easy to backtrack the kernel
-> to find out when that end of the problem first started occurring?

I haven't had a chance to do this yet (maybe Chris will beat me to it :-)
since the O2 has too small of a disk to fit a whole source tree and room
for even a minimal 2.0 chroot (and I didn't want to totally zorch my user-
land).

I did try two trival setjmp/longjmp and getcontext/setcontext examples
and neither exhibited any issues.

-> I've just tried a fresh 2.0 branch build on a little-endian "sbmips"
-> board, and this is what I see:
-> 
-> 	NetBSD 2.0_BETA (GENERIC) #0: Tue May 11 12:49:29 EST 2004
-> 
-> 	Welcome to NetBSD!
-> 
-> 	pid 280 (csh), uid 0: exited on signal 11 (core dumped)
-> 	Badly placed ()'s.
-> 	rhone# 

Hmm.  I'm running a 2.0E kernel on my O2 and an older (1.6ZK timeframe,
IIRC) userland.  I set up a minimal 2.0_BETA chroot, and see the following:

	
	toaster-ex# chroot 2.0-chroot /bin/csh -l
	panic: cache error @ EPC 0x8036445c ErrCtl 0x0 CacheErr 0xa00f6702
	panic: cache error @ EPC 0x80247544 ErrCtl 0x0 CacheErr 0xa011d21b
	Stopped at      netbsd:cpu_Debugger+0x4:        jr      ra
                bdslot: nop
	db> 

	(however, with out the -l it starts fine and mostly works)

-> but Manuel's "ll" test just wedges this box, such that I can't even get
-> in to ddb:
-> 
-> 	rhone# alias ll ls -lgF
-> 	rhone# ll /lib/libc.so*
-> 	[ hang ]

Hmm, on sgimips this also produces the cache errors (I set up the alias
from memory so my "ll" alias was set to "ls -laF" instead), even if I 
skip the args to "ll".

Interestingly, if I attach gdb to the csh process and set a breakpoint
at __vfork14, I hit the breakpoint in gdb and when I continue get the
output I expected.  Makes me almost wonder if this isn't somehow related
to the runtime linker or maybe zorched relocs in libc.  Or maybe some
key bit of code *is* actually missing a cache flush and the stop in
gdb takes care of it.  I'm still not sure if it actually means or points
to anything specific, but it was odd enough that it was worth reporting.

(I was trying to figure out which syscall killed it and ktruss'ing it 
 didn't help since the panic was emitted before the ktruss record; so
 I figured I'd ktrace a working binary and then try to figure out which
 syscall made it go gonzo by stopping at each...)

-> I agree what userland being able to hang the kernel is definitely not
-> a good thing, but I can't recall what could have changed in -current
-> to fix the problem.  I'll look in to this further...

Yah, I think we all agree this is bad... Though I'm glad it's not just
specific to port-sgimips in some sense.  I wonder why other mips ports
aren't reporting this... 

Hopefully you'll find something, or the rest of us beating our heads on
it will do more than get headaches ;-)

Thanks,
--rafal

----
Rafal Boni                                                     rafal@pobox.com
  We are all worms.  But I do believe I am a glowworm.  -- Winston Churchill