At Wed, 18 Jan 2012 08:48:30 -0500, Thor Lancelot Simon <tls%panix.com@localhost> wrote: Subject: Re: ongoing major problems with NetBSD-5 and LOCKDEBUG on multi-core system (mfi(4) related?) > > I don't think either of these is the right way. > > Unfortunately, I don't know enough about our storage drivers these days > to recommend one as a good example for you to look at; but I think it > would be a much better idea to look at another storage driver and see > how the locking works than to copy OpenBSD or throw KERNEL_LOCK() > calls around, which is what the changes in the source tree seem to > have done. Indeed, though now that I've removed the possibly errant uses of KERNEL_LOCK() in mfi(4), even if incorrectly, and I'm still getting LOCKDEBUG panics, I'm beginning to think I should be looking elsewhere for the cause of the main trouble I'm seeing. > What is being protected by those calls? The driver, from another kernel > subsystem? Or the driver, from itself? If the latter, it would be much > better to think about the driver's actual locking needs and add or adjust > locks accordingly. I _think_ it's the former -- the initial change adding KERNEL_LOCK() and splbio() calls was around code which calls an internal function to grab sensor data on behalf of sysmon_envsys(9): ---------------------------- revision 1.28 date: 2009/08/26 21:41:05; author: bouyer; state: Exp; lines: +8 -5 mfi.c still uses the spl() synchronisation scheme and so needs the kernel lock. The sysmon subsystem is marked MPSAFE and so runs without the kernel lock. So get the kernel lock in mfi_sensor_refresh() before calling mfi_ioctl_vol(). This fixes command list corruption seen on heavy I/O load on the mfi driver(4). ---------------------------- The second change was to wrap up all the calls to functions inside the main ioctl() routine, because bio(4) can call it too: ---------------------------- revision 1.31 date: 2010/01/19 20:54:32; author: bouyer; state: Exp; lines: +7 -3 branches: 1.31.2; bio(4) is MP-safe but mfi(4) is not. So get the kernel_lock at mfi_ioctl() entry and release it on exit. ---------------------------- The problem is all of these ioctl() internal functions basically post a command to the hardware then call tsleep() waiting for it to complete. So if I understand correctly these calls from other subsystems are not in an interrupt context and if I remember correctly tsleep() should be fine in any normal driver ioctl() function (for a non-MPSAFE driver, if indeed its ioctl() function was in the device switch table), but clearly grabbing the KERNEL_LOCK() is the wrong thing to do for this case, at least without releasing it properly before the tsleep() call. It seems like bio(4) and sysmon_envsys(9) might be a little aggressive in their assumptions too -- perhaps they should not assume the device callbacks they use are in MPSAFE parts of the system. I think _they_ should know to use a mutex to prevent overlapping calls in these cases. Either that or they should be removed from drivers which are not MPSAFE. The splbio()/splx() calls I added from OpenBSD seem quite reasonable and correct, at least to me :-), but they don't solve the problem with sysmon_envsys(9) and bio(4) calling into a non-MPSAFE driver. I don't suppose the OpenBSD driver has been made MPSAFE yet, and in a way I could copy to NetBSD.... It does have the CCB mutex.... Obviously I'll have to do something about this before I go on: 20:29 [8] $ /usr/sbin/envstat -r load: 0.00 cmd: envstat 3421 [tstile] 0.00u 0.00s 0% 116k ^?load: 0.00 cmd: envstat 3421 [tstile] 0.00u 0.00s 0% 116k ^Zload: 0.00 cmd: envstat 3421 [tstile] 0.00u 0.00s 0% 116k (a "kill -9 3421" from root is no help either :-)) -- Greg A. Woods Planix, Inc. <woods%planix.com@localhost> +1 250 762-7675 http://www.planix.com/
Attachment:
pgpe3lzmPbrsp.pgp
Description: PGP signature