Subject: anyone know if there's a fix for this "malloc with held simple_lock" in RAIDframe bug yet?
To: NetBSD/alpha Discussion List <port-alpha@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: port-alpha
Date: 03/14/2005 00:30:14
I was just trying to set up a RAID-1 mirror of the root drives on an
alphaserver (with a 1.6.x MP kernel) and the instant I ran "raidctl -C"
the following spewed forth on the console.

malloc with held simple_lock 0xfffffc00006135c0 CPU 1 /building/work/woods/m-NetBSD-1.6/sys/dev/raidframe/rf_driver.c:356
alpha trace requires known PC =eject=

CPU 1: fatal kernel trap:

CPU 1    trap entry = 0x3 (instruction fault)
CPU 1    a0         = 0x1
CPU 1    a1         = 0xfffffc000054ef64
CPU 1    a2         = 0x324
CPU 1    pc         = 0xfffffc000051fa64
CPU 1    ra         = 0xfffffc00003ea388
CPU 1    pv         = 0xfffffc000051fa60
CPU 1    curproc    = 0xfffffc0090710b90
CPU 1        pid = 4224, comm = raidctl

panic: trap


Apparently I'm also having some kind of problem with my kernel going
into an infinite "panic: trap" loop too when it tries to enter DDB.  I
have to hit the reset button to get control. Maybe DDB is not starting
on the right CPU or some such weirdness?

Ah, finally, here's a proper trace from another crash that happened
during an attempt to reboot back to multiuser before moving
/etc/raid0.conf out of the way (I guess the panic doesn't always
loop)....


RAIDFRAME: protectedSectors is 64

malloc with held simple_lock 0xfffffc00006135c0 CPU 2 /building/work/woods/m-NetBSD-1.6/sys/dev/raidframe/rf_driver.c:356
alpha trace requires known PC =eject=
Stopped in pid 23 (raidctl) at  cpu_Debugger+0x4:       ret     zero,(ra)
db{2}> trace
cpu_Debugger() at cpu_Debugger+0x4
simple_lock_only_held() at simple_lock_only_held+0x148
malloc() at malloc+0x90
rf_ConfigureMapModule() at rf_ConfigureMapModule+0x4c
rf_Configure() at rf_Configure+0x2ac
raidioctl() at raidioctl+0xa20
spec_ioctl() at spec_ioctl+0xec
vn_ioctl() at vn_ioctl+0x154
sys_ioctl() at sys_ioctl+0x4ac
syscall_plain() at syscall_plain+0x158
XentSys() at XentSys+0x5c
--- syscall (54) ---
--- user mode ---
db{2}> machine slock 0xfffffc00006135c0
lock_data=1 holder=2
 last locked=/building/work/woods/m-NetBSD-1.6/sys/dev/raidframe/rf_driver.c:356

 last unlocked=(null):0

db{2}> 


I guess I'm not going to be mirroring the system disk just yet.....


In any case does anyone know if there's a fix already in -current for
this "malloc with held simple_lock" bug yet, and if so where I might
hope to find it?

The only difference I see in the -current in that area is that the
RF_LOCK_MUTEX(configureMutex) call has been changed to be
RF_LOCK_LKMGR_MUTEX(configureMutex).  That may be the fix according to
the comment associated with the change (rev. 1.55, and yes my kernel
does also use LOCKDEBUG), but I'd like to get some confirmation before I
try pulling that change into my own sources.

-- 
						Greg A. Woods

H:+1 416 218-0098  W:+1 416 489-5852 x122  VE3TCP  RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>