Subject: kern/26568: Yesterday's "pciide" `irqack fix' breaks Promise 202xx controllers
To: None <gnats-bugs@gnats.NetBSD.org>
From: None <paul@Plectere.com>
List: netbsd-bugs
Date: 08/06/2004 02:24:43
>Number: 26568
>Category: kern
>Synopsis: an occasional "pdcide0:0 bogus intr (reg 0x1xxxxxxxx)" is fatal
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Aug 06 09:28:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator: Paul Shupak
>Release: NetBSD 2.0G
>Organization:
>Environment:
System: NetBSD svcs 2.0G NetBSD 2.0G (SVCS) #262: Fri Aug 6 01:17:44 PDT 2004 root@svcs:/sys/arch/i386/compile/SVCS i386
Architecture: i386
Machine: i386
>Description:
For several years I have seen (a few a day - under heavy load)
spurious interrupts on Promise FastTrack-66s used as part of RAID arrays.
With yesterday's changes, the "bogus" interrupts repeats indefinitely
instead of stopping after a single instance (i.e. the machine "printf"'s
until either the reset or power buttons are hit). This has never been a
problem on these controllers before (I have two as part of RAID arrays in
different machines (the Via chipset gives ~6 times the "bogus" interrupts
as the Intel chipset - VT82C691 vs. i810, both give "bogus" interrupts for
their PDCs). Neither card is "sharing" PCI interrupts with any other device
in either machine (one has irq 5 dedicated. the other has irq 11).
Reverting pdcide.c to version 1.11 from 1.12 solves the problem for
me (i.e. back to the occasional single "bogus" interrupt - about 15-25 a day
on the Via machine about 2-4 a day on the i810).
>How-To-Repeat:
Boot a machine with e FastTrack-66, then cause heavy disk activity
to occur (a RAID5 parity rebuild works just fine). Watch the endless loop
of "bogus" messages.
>Fix:
Revert the change for the Promises? Maybe a further test on the
wdc state beyond the simple "wdcintr(wdc_cp)"? - Either way, please do not
write the "IDEDMA_CTL" during the interrupt without acknowledging the interrupt
to the hardware(i.e. the EOI dance on x86) (If a DMA is really pending, we
can get into the infinite-loop case described (remember, now the WDC `cause'
has been cleared) beginning when the outstanding DMA completes or we lose
the outstanding transaction - neither is a good choice; The outstanding
request causes another "bogus" interrupt, etc), or look into a non-zero
return and doing the EOI dance to prevent redelivery of the same interrupt
(Note: the case in the Promise returns zero, if we're eating the interrupt,
we probably should return one -- i.e. "rv = 1;" ? - I didn't test this, but
it seem like it might be simple enough to work).
>Release-Note:
>Audit-Trail:
>Unformatted: