Subject: kern/35769: aac(4) needs updating from FreeBSD or replacing from OpenBSD (or fixing of a lost interrupt bug)
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Greg A. Woods <woods@planix.com>
List: netbsd-bugs
Date: 02/21/2007 20:55:02
>Number: 35769
>Category: kern
>Synopsis: aac(4) needs updating from FreeBSD or replacing from OpenBSD (or fixing of a lost interrupt bug)
>Confidential: yes
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Feb 21 20:55:02 +0000 2007
>Originator: Greg A. Woods
>Release: NetBSD netbsd-4
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:
System: NetBSD 4.0_BETA2
Architecture: i386
Machine: i386 (Dell PE2650 with embedded PERC 3/Di RAID and add-on HP NetRAID-4M controllers)
>Description:
(I think this is a "critical" bug (at least w.r.t. the "freezing"
effect) since it no doubt could affect anyone and everyone using
aac(4), but I'm not hung up on it being "critical" since the
work-around does seem to be quite effective.)
I've been experimenting with the netbsd-4 branch on a dual-CPU Dell
PowerEdge 2650 (my new-to-me build server) with its embedded Dell PERC
3/Di RAID controller, and an add-in HP NetRAID-4M controller for some
external disks. Both of course are detected and managed by the aac(4)
driver.
The install was done from a build of the netbsd-4 branch that I did on a
host running netbsd-3. There are no local kernel changes other than to
the text of a few of the error messages in sys/dev/ic/pckbc.c. The
GENERIC.MPACPI kernel is the same config, verbatim, from the official
tree.
I've been attempting to rebuild the netbsd-4 branch on the new machine.
However I've been experiencing frequent, but temporary, filesystem
lockups, typically during bursts of I/O to and from the /build
filesystem, such as when linking or installing a program. The process
that gets stuck seems to be in the getblk state:
I've worked around the problem for now with the following line added to
the bottom of /etc/rc.local:
nohup sh -c 'while : ; do dd if=/dev/rld0d of=/dev/null count=1; sleep 1; done' &
This does an IO on the stuck device once per second, restarting it, and
it seems to keep everything moving along reasonably smoothly.
I did capture some not-terribly-interesting kernel info from DDB though
and all you basically see is "ld" stuck in getblk state:
[Mon Feb 19 22:50:28 2007]login: [-- break #1 sent -- `\z' -- Mon Feb 19 22:54:51 2007]
Stopped at netbsd:cpu_Debugger+0x4: popl %ebp
[Mon Feb 19 22:54:51 2007]db{0}> ps
[Mon Feb 19 22:54:53 2007] PID PPID PGRP UID S FLAGS LWPS COMMAND WAIT
[Mon Feb 19 22:54:53 2007] 3920 18420 10164 1000 2 0x4002 1 ld getblk
[Mon Feb 19 22:54:53 2007] 18420 16402 10164 1000 2 0x4002 1 collect2 wait
[Mon Feb 19 22:54:53 2007] 16402 3137 10164 1000 2 0x4002 1 i386--netbsdelf- wait
[Mon Feb 19 22:54:53 2007] 3137 9416 10164 1000 2 0x4002 1 sh wait
[Mon Feb 19 22:54:53 2007] 9416 13265 10164 1000 2 0x4002 1 nbmake wait
[Mon Feb 19 22:54:53 2007] 13265 3034 10164 1000 2 0x4002 1 sh wait
[Mon Feb 19 22:54:53 2007] 3034 15061 10164 1000 2 0x4002 1 nbmake wait
[Mon Feb 19 22:54:53 2007] 15061 2593 10164 1000 2 0x4002 1 sh wait
[Mon Feb 19 22:54:53 2007] 2593 27295 10164 1000 2 0x4002 1 nbmake wait
[Mon Feb 19 22:54:53 2007] 27295 15819 10164 1000 2 0x4002 1 sh wait
[Mon Feb 19 22:54:53 2007] 18596 6309 18596 1000 2 0x4002 1 tail kqread
[Mon Feb 19 22:54:53 2007] 6309 20011 6309 1000 2 0x4002 1 ksh pause
[Mon Feb 19 22:54:53 2007] 20011 763 763 0 2 0x4100 1 rlogind poll
[Mon Feb 19 22:54:53 2007] 15819 10280 10164 1000 2 0x4002 1 nbmake wait
[Mon Feb 19 22:54:53 2007] 10280 28521 10164 1000 2 0x4002 1 sh wait
[Mon Feb 19 22:54:53 2007] 28521 27102 10164 1000 2 0x4002 1 nbmake wait
[Mon Feb 19 22:54:53 2007] 27102 2473 10164 1000 2 0x4002 1 sh wait
[Mon Feb 19 22:54:53 2007] 12119 621 12119 1000 2 0x4102 1 systat ttyin
[Mon Feb 19 22:54:53 2007] 2473 9639 10164 1000 2 0x4002 1 nbmake wait
[Mon Feb 19 22:54:53 2007] 9639 10936 10164 1000 2 0x4002 1 sh wait
[Mon Feb 19 22:54:53 2007] 10936 10384 10164 1000 2 0x4002 1 nbmake wait
[Mon Feb 19 22:54:53 2007] 10384 10164 10164 1000 2 0x4002 1 sh wait
[Mon Feb 19 22:54:53 2007] 10164 621 10164 1000 2 0x4002 1 ksh pause
[Mon Feb 19 22:54:53 2007] 621 620 621 1000 2 0x4003 1 ksh pause
[Mon Feb 19 22:54:54 2007] 620 763 763 0 2 0x4100 1 rlogind poll
[Mon Feb 19 22:54:54 2007] 695 1 695 0 2 0x4002 1 getty ttyin
[Mon Feb 19 22:54:54 2007] 826 1 826 0 2 0x4002 1 getty ttyin
[Mon Feb 19 22:54:54 2007] 782 1 782 0 2 0x4002 1 getty ttyin
[Mon Feb 19 22:54:54 2007]--db_more--
[Mon Feb 19 22:55:01 2007]db{0}> bt
[Mon Feb 19 22:55:08 2007]cpu_Debugger(c5145200,0,cea02f38,c0334f70,c5145260) at netbsd:cpu_Debugger+0x4
[Mon Feb 19 22:55:08 2007]comintr(c5145200,0,cea00010,30,10) at netbsd:comintr+0x6e0
[Mon Feb 19 22:55:08 2007]Xintr_ioapic_edge4() at netbsd:Xintr_ioapic_edge4+0x9c
[Mon Feb 19 22:55:08 2007]--- interrupt ---
[Mon Feb 19 22:55:08 2007]netbsd:cpu_switch+0xab:
[Mon Feb 19 22:55:46 2007]db{0}> machine cpu 1
[Mon Feb 19 22:55:55 2007]using CPU 1
[Mon Feb 19 22:55:55 2007]db{0}> bt
[Mon Feb 19 22:55:58 2007]netbsd:cpu_switch+0xab:
[Mon Feb 19 22:55:58 2007]db{0}> machine cpu 2
[Mon Feb 19 22:56:04 2007]using CPU 2
[Mon Feb 19 22:56:04 2007]db{0}> bt
[Mon Feb 19 22:56:05 2007]netbsd:cpu_switch+0xab:
[Mon Feb 19 22:56:05 2007]db{0}> machine cpu 3
[Mon Feb 19 22:56:08 2007]using CPU 3
[Mon Feb 19 22:56:08 2007]db{0}> bt
[Mon Feb 19 22:56:09 2007]netbsd:cpu_switch+0xab:
[Mon Feb 19 22:56:09 2007]db{0}> machine cpu 0
[Mon Feb 19 22:56:14 2007]using CPU 0
[Mon Feb 19 22:56:14 2007]db{0}> bt
[Mon Feb 19 22:56:17 2007]cpu_Debugger(c5145200,0,cea02f38,c0334f70,c5145260) at netbsd:cpu_Debugger+0x4
[Mon Feb 19 22:56:17 2007]comintr(c5145200,0,cea00010,30,10) at netbsd:comintr+0x6e0
[Mon Feb 19 22:56:17 2007]Xintr_ioapic_edge4() at netbsd:Xintr_ioapic_edge4+0x9c
[Mon Feb 19 22:56:17 2007]--- interrupt ---
[Mon Feb 19 22:56:17 2007]netbsd:cpu_switch+0xab:
[Mon Feb 19 22:56:17 2007]db{0}> cont
[Mon Feb 19 22:57:03 2007]
[Mon Feb 19 22:57:08 2007]
[Mon Feb 19 22:57:08 2007]NetBSD/i386 (once) (constty)
[Mon Feb 19 22:57:08 2007]
[Mon Feb 19 22:57:08 2007]login:
It looks more and more that the big culprit is the extreme age and
bit-rot in the aac(4) driver is at fault. There have been many
bug-fixes, improvements, and updates to the original FreeBSD sources
since it was ported to NetBSD, including this interesting one with a
possible fix and explanation for lost interrupts which would explain why
the I/O requests pending at the time of the hang never complete until
another I/O is started.
Unfortunately this fix was done after the driver's interrupt handling
was threaded:
----------------------------
revision 1.46
date: 2003/01/11 01:59:21; author: scottl; state: Exp; lines: +50 -39
Major bugfixes for large memory and fast systems.
aac.c:
Re-arrange the interrupt handler to optimize the common case of
the adapter interrupting us because one or more commands are complete,
and do a read across the pci bus to ensure that all posted status
writes are flushed. This should close a race that could cause command
completion interrupts to be lost.
Follow the spec a bit closer when filling out command structures.
Enable the Fast Response feature to eliminate the need for the card
to DMA successfull command completions back into host memory.
Tell the controller how much physical memory we have. Without this
there was a chance that our DMA regions would collide with the
memory window used by the cache on the controller. The result would
be massive data corruption. This seemed to mainly affect systems with
>2GB of memory.
Fix a few whitespace problems.
aac_debug.c:
Add an extra diagnostic when printing out commands.
aac_disk.c:
Add extra sanity checks.
aacreg.h:
Prepare for making this 64-bit clean by reducing the use of enumeration
types in structures.
Many thanks to Justin Gibbs for helping track these down.
----------------------------
>How-To-Repeat:
>Fix:
Re-porting the driver from FreeBSD seems quite daunting, given the
major divergences growing between FreeBSD and NetBSD, particularly in
storage device drivers.
However much/all of the good work put into the FreeBSD driver has been
picked up by OpenBSD in their re-write of the aac(4) driver. It would
seem on first glance to be much easier to port the new OpenBSD driver
back over to NetBSD. Unfortunately I don't have time to do this all
on my own at the moment, though I would be able to help test, at least
on the netbsd-4 branch (though if testing only requires booting a
-current kernel on a netbsd-4 userland, I can do that too). I have
several HP NetRAID cards as well as this Dell PE2650 with the embedded
PERC controller, and if necessary I could swap other HP cards in for
testing too.
At minimum it sure would be nice if someone could figure some way to
fix whatever seems to be causing interrupts to be lost.
However unless the minimum fix is terribly small and easy I think the
additional benefits that could be had by updating the driver would be
far more desirable. Perhaps a newer driver will even make my other
problem (always reading zeros from the second controller) go away too,
though as yet I've not had a chance to test this with an OpenBSD or
FreeBSD boot.