Subject: Re: Massive interrupt problems on Tyan 2882-D
To: None <port-amd64@NetBSD.org>
From: =?ISO-8859-1?Q?Edgar_Fu=DF?= <ef@math.uni-bonn.de>
List: port-i386
Date: 03/22/2007 17:41:48
Since the problem is probably not amd64 specific and lacking port-
x86, I cc to port-i386 although I do not read that list.
As the subject says, I've got interrupt problems on a 2882-D (one
single-core CPU fitted, non-SMP kernel) where drivers seem suddenly
to cease receiving interrupts.
Specifically, ahd(4) complains about timed out SCBs being already
complete and bge(4) about blocks that do not stop.
The machine runs fine for hours (under NFS load) and then suddenly
locks up (almost, every disk I/O takes minutes to complete).
The more I think about it, the weirder it gets.
We have:
-- ahd0 and bge0 sharing ioapic1 pin0, thus irq5.
-- ahd1 and bge1 sharing ioapic1 pin1, thus irq10.
Given that according to Tyan's block diagrams, all of these devices
are on the same PCI bus (Bus A of the 8131), it looks reasonable that
they actually share two PCI Interrupts and thus two IOAPIC pins.
Also, if I switch to a non-ioapic kernel, dmesg keeps reporting them
as using irq5 and irq10.
-- A 36G RAID1 on ahd0 containing the OS.
-- A 928G RAID5 on ahd1 holding user data (not really, at the moment).
-- Active traffic on bge0.
-- Nothing on bge1: the interface is down.
The machine ran without problems, even with heavy I/O on ahd1
(rebuilding RAID parity).
I had two situations where problems arose, both involving heavy usage
as an NFS server. Unfortunately, that's exactly what the machine is
supposed to be used for.
Both times, I had a (linux) NFS client writing large amounts of data
to the raid on ahd1. Both times, that data came in on bge0.
First time, I got errors on ahd1 (since I didn't use bge1 at that
time, I know nothing about that driver). But I got no errors on bge0
nor ahd0. So this looks like a problem on IRQ10.
Second time, I got errors on both ahd0 and bge0 while ahd1 worked.
If this was an interrupt sharing issue, why would I get problems in
case one? There's no activity on bge1 sharing the interrupt with ahd1.
If it was an issue in ahd(4), why doesn't it show up when building
RAID parity?
It might be some issue in bge(4), but why did that affect ahd1 while
leaving bge0 unaffected in case one?
The problem seems to be triggered by simultaneously high network and
SCSI traffic. But in case two, the net traffic involved irq5 while
the SCSI traffic involved irq10.
I once thought it might be some spl confusion in bge(4), but I think
I would have had even more fun if it were. Also, this would have
affected bge0 in case one.
Any ideas, anyone? I've got four to five identical machines to test
all sort of things on, albeit only one storage box with really large
amounts of disk space.