NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/41579: siista driver interrupt and command completion handling broken
>Number: 41579
>Category: kern
>Synopsis: siista driver interrupt and command completion handling broken
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Jun 12 10:45:00 +0000 2009
>Originator: Wolfgang Stukenbrock
>Release: NetBSD 5.0
>Organization:
Dr. Nagler & Company GmbH
>Environment:
System: NetBSD s012 4.0 NetBSD 4.0 (NSW-S012) #9: Fri Mar 13 12:31:52 CET 2009
wgstuken@s012:/usr/src/sys/arch/amd64/compile/NSW-S012 amd64
Architecture: x86_64
Machine: amd64
>Description:
Our systems with io-cards that uses the siisata driver has locked up
after a short time all the time.
I've analysed the problem and located two errors in the driver.
1. if a command completes very fast after it has been started while the
interrupt handling is still running for that port, the driver
clears any pending interrupt on interrupt routine exit for the port.
This may clear the next pending interrupt, if it is already present.
2. if a bio-command runs into timeout, the completion routine does not
check for the timeout condition and the failed transfer never gets
to the ata-routines. (the code in the atapi-completion routine is
broken too.)
The cause for the first problem is a general error in the interrupt
code. You may never get some interrupt conditions, process them and
clear it (or all interrupts as done by the driver) later.
The driver must clear the interrupt conditions that it is gooing to
process immidiatly after determining the set of conditions and before it
starts processing of the conditions. Otherwise there is a risk of
loosing some conditions.
The patch below will correct this behavior. The automatic clear of the
completion interrupt by reading the slot-status is disabled
for security reasons by the patch. It makes no longer sence to use it
anymore and makes debugging of the driver very hard ...
The cause for the second problem is just a missing check in the routine
for the timeout condition.
The patch below will correct this in siisata_bio_complete().
The siisata_atapi_complete() routine has been corrected too. But the
patch will not change the processing order (timeout versus interrupt),
because I'm not shure if this is just another error in the driver or if
there is a special reason why the interrupts for atapi devices are
not cleared if a command has timed out. Someone else should have a look
at this, because I think the current version of the atapi completion
routine is incorrect! At least it is inconsistent to the command and
bio completion routines.
While debugging the driver I've seen another programming error, that
will not influence the operation as long as there is only one CPU
nodifiying the status register at the time. Ths Stealvine chip has a
Bit-set (PCS) and Bit-clear (PCC) semantic for the port configuration register.
So it makes no sence - or may lead to unexpected results if another CPU
is just clearing a bit - to read the old status, add some bits and
then write the result into the bit-Set-register.
The way that should be used is just to write the bits that should be
set now to the PCS register (or PCC when clearing a bit).
There is an own definition for the read-mode of the port status "PS"in
the headerfile. That should be used in all read operations and not
the PCS definition.
The patch below will fix this too.
With this patch, the drivers runs stable on our systems without locking
up anymore.
remark:
We have only ata devices connected - so the atapi code does not run and
we have never seen any real error of the drives up to now.
The chip-manual accedently says nothing about the correct order reading
the cause of the error and clearing the error-interrupt.
And this patch will change the order from "read error then clear
interrupt" to "clear interrupt prior read error status".
As long as only one slot will ever used by the driver - this means only
one command is on the chip - this should be no problem at all.
Due to the fact that the chip stops processing after the first error
and all pending commands need to be reissued again, the order should
be not relevant. But I can be wrong here ..
If desired (or required) it would be easy to delay clearing the
interrupt condition until the error status has been read.
(attention: the port-slot-status may not be read prior clearing the
completion interrupt! Otherwise there will be the next whole in
interrupt processing and some interrupts may be lost ... So the
command-complete interrupt must be cleared prior processing!)
>How-To-Repeat:
Setup a system with e.g. an SIL3124 controller and start IO in it.
After some time you will loose the contact to a disk due to a missed
interrupt and the broken timeout handling in siisata_bio_complete().
>Fix:
The following patch to /usr/src/sys/dev/ic/siisata.c will fix the
problems above. (output of "diff -c")
*** siisata.c-orig Wed Jun 10 15:50:37 2009
--- siisata.c Fri Jun 12 10:20:19 2009
***************
*** 233,238 ****
--- 233,240 ----
/* come out of reset, 64-bit activation */
PRWRITE(sc, PRX(chp->ch_channel, PRO_PCC),
PR_PC_32BA | PR_PC_PORT_RESET);
+ /* disable clear interrupt on PSS read */
+ PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_INCOR);
/* initialize port */
siisata_reinit_port(chp);
/* clear any interrupts */
***************
*** 393,398 ****
--- 395,410 ----
for (i = 0; i < sc->sc_atac.atac_nchannels; i++)
if (is & GR_GIS_PXIS(i))
siisata_intr_port(sc, &sc->sc_channels[i]);
+ #ifdef DIAGNOSTIC
+ is &= ~GR_GIS_PXIS_MASK;
+ if (is) { /* reset any unexpected interrupts to avoid busy
loops ... */
+ log(LOG_WARNING, "%s: resetting unexpected interrupt
(GIS 0x%x)\n", SIISATANAME(sc), is);
+ GRWRITE(sc, GR_GIS, is);
+ }
+ #else
+ if (is & ~GR_GIS_PXIS_MASK)
+ panic("%s: %s: unexpected interrupt (GIS 0x%x)",
SIISATANAME(sc), __func__, is);
+ #endif
}
return r;
}
***************
*** 407,421 ****
SIISATA_DEBUG_PRINT(("%s: %s port %d\n",
SIISATANAME(sc), __func__, chp->ch_channel), DEBUG_INTR);
if ((xfer != NULL) && (xfer->c_intr != NULL))
xfer->c_intr(chp, xfer, slot);
#ifdef DIAGNOSTIC
- else
log(LOG_WARNING, "%s: unable to handle interrupt\n", __func__);
#endif
!
! /* clear some (ok, all) ints */
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), 0xffffffff);
return;
}
--- 419,434 ----
SIISATA_DEBUG_PRINT(("%s: %s port %d\n",
SIISATANAME(sc), __func__, chp->ch_channel), DEBUG_INTR);
+ /* remark: the called routine must clear all processed interrupt itself -
otherwise we may loose some interrupts ... */
if ((xfer != NULL) && (xfer->c_intr != NULL))
xfer->c_intr(chp, xfer, slot);
+ else {
#ifdef DIAGNOSTIC
log(LOG_WARNING, "%s: unable to handle interrupt\n", __func__);
#endif
! /* clear some (ok, all) ints */
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), 0xffffffff);
! }
return;
}
***************
*** 430,436 ****
int slot = SIISATA_NON_NCQ_SLOT;
/* wait for ready */
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
DELAY(10);
prb = schp->sch_prb[slot];
--- 443,449 ----
int slot = SIISATA_NON_NCQ_SLOT;
/* wait for ready */
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
DELAY(10);
prb = schp->sch_prb[slot];
***************
*** 490,496 ****
SIISATANAME(sc), chp->ch_channel);
/* XXX and then ? */
}
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
DELAY(10);
PRWRITE(sc, PRX(chp->ch_channel, PRO_SERROR),
PRREAD(sc, PRX(chp->ch_channel, PRO_SERROR)));
--- 503,509 ----
SIISATANAME(sc), chp->ch_channel);
/* XXX and then ? */
}
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
DELAY(10);
PRWRITE(sc, PRX(chp->ch_channel, PRO_SERROR),
PRREAD(sc, PRX(chp->ch_channel, PRO_SERROR)));
***************
*** 546,552 ****
schp->sch_sstatus)) {
case SStatus_DET_DEV:
/* wait for ready */
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS))
& PR_PS_PORT_READY))
DELAY(10);
--- 559,565 ----
schp->sch_sstatus)) {
case SStatus_DET_DEV:
/* wait for ready */
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS))
& PR_PS_PORT_READY))
DELAY(10);
***************
*** 781,792 ****
("%s: %s\n", SIISATANAME(sc), __func__), DEBUG_FUNCS);
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
if ((xfer->c_flags & C_TIMEOU) != 0)
goto command_done;
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status, clearing completion interrupt */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 794,807 ----
("%s: %s\n", SIISATANAME(sc), __func__), DEBUG_FUNCS);
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ /* clear only those interrupts, we are gooing to server now ... */
+ if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
if ((xfer->c_flags & C_TIMEOU) != 0)
goto command_done;
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status - interrupt cleared above - PR_PC_INCOR set
... */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
***************
*** 1062,1070 ****
uint32_t *prbfis = (void *)fis;
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status, clearing completion interrupt */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 1077,1090 ----
uint32_t *prbfis = (void *)fis;
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ /* clear only those interrupts, we are gooing to server now ... */
+ if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
+
+ if ((xfer->c_flags & C_TIMEOU) != 0)
+ goto command_done;
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status - interrupt cleared above - PR_PC_INCOR set
... */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
***************
*** 1116,1122 ****
command_done:
schp->sch_active_slots &= ~__BIT(slot);
chp->ch_flags &= ~ATACH_IRQ_WAIT;
! callout_stop(&chp->ch_callout);
chp->ch_queue->active_xfer = NULL;
--- 1136,1147 ----
command_done:
schp->sch_active_slots &= ~__BIT(slot);
chp->ch_flags &= ~ATACH_IRQ_WAIT;
! if (xfer->c_flags & C_TIMEOU) {
! ata_bio->error = TIMEOUT;
! } else {
! callout_stop(&chp->ch_callout);
! ata_bio->error = NOERROR;
! }
chp->ch_queue->active_xfer = NULL;
***************
*** 1270,1278 ****
{
struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS),
! PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) | PR_PC_PORT_INITIALIZE);
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
DELAY(10);
}
--- 1295,1302 ----
{
struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_PORT_INITIALIZE);
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
DELAY(10);
}
***************
*** 1281,1289 ****
{
struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS),
! PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) | PR_PC_DEVICE_RESET);
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PCS)) & PR_PS_PORT_READY))
DELAY(10);
}
--- 1305,1312 ----
{
struct siisata_softc *sc = (struct siisata_softc *)chp->ch_atac;
! PRWRITE(sc, PRX(chp->ch_channel, PRO_PCS), PR_PC_DEVICE_RESET);
! while (!(PRREAD(sc, PRX(chp->ch_channel, PRO_PS)) & PR_PS_PORT_READY))
DELAY(10);
}
***************
*** 1456,1464 ****
periph->periph_cap |= PERIPH_CAP_CMD16;
/* configure port for packet length */
! PRWRITE(siic, PRX(chp->ch_channel, PRO_PCS),
! PRREAD(siic, PRX(chp->ch_channel, PRO_PCS)) |
! PR_PC_PACKET_LENGTH);
}
/* XXX This is gross. */
periph->periph_cap |= (id->atap_config & ATAPI_CFG_DRQ_MASK);
--- 1479,1485 ----
periph->periph_cap |= PERIPH_CAP_CMD16;
/* configure port for packet length */
! PRWRITE(siic, PRX(chp->ch_channel, PRO_PCS),
PR_PC_PACKET_LENGTH);
}
/* XXX This is gross. */
periph->periph_cap |= (id->atap_config & ATAPI_CFG_DRQ_MASK);
***************
*** 1649,1657 ****
}
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status, clearing completion interrupt */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
--- 1670,1680 ----
}
pis = PRREAD(sc, PRX(chp->ch_channel, PRO_PIS));
+ /* clear only those interrupts, we are gooing to server now ... */
+ if (pis != 0) PRWRITE(sc, PRX(chp->ch_channel, PRO_PIS), pis);
if (pis & PR_PIS_CMDCMPL) {
! /* get slot status - interrupt cleared above - PR_PC_INCOR set
... */
pss = PRREAD(sc, PRX(chp->ch_channel, PRO_PSS));
/* is this expected? */
if ((schp->sch_active_slots & __BIT(slot)) == 0) {
>Unformatted:
Home |
Main Index |
Thread Index |
Old Index