Subject: Re: kern/31455: ex (905[BC]) cards can hang in -current kernels
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Robert Elz <kre@munnari.OZ.AU>
List: netbsd-bugs
Date: 10/04/2005 16:15:03
The following reply was made to PR kern/31455; it has been noted by GNATS.
From: Robert Elz <kre@munnari.OZ.AU>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: kern/31455: ex (905[BC]) cards can hang in -current kernels
Date: Tue, 04 Oct 2005 23:05:14 +0700
Christos Zoulas suggested a patch something like the one appended.
With this change in place, the problem has gone, and I can no longer
cause the problem to occur.
I am a little concerned about what this would mean to a big-endian system
with a PCI pus and an ex card in it, but I will leave that for someone
who understands PCI on big-endian systems to contemplate.
I have also caused the "excess collision" case (which is what was
happening to me, and probably the only realistic way to force this)
to count as an output error (which it is, on every ethernet chip I've
ever programmed, and I assume it is here as well) - that enabled me to
verify that the condition was still occurring (the test case had not
altered) but is now being correctly handled. I'd suggest that that
part of the patch be retained (but I don't think counting excess collisions
as a single collision makes any sense at all - either don't count, or
count as 16 (or 14 if the first two are already accounted for) or something.
But not 1 collision...
While here two other things in elinkxl.c for someone to consider ...
First, the bus_space_read_1() in ex_probemedia() that reads
ELINK_W3_RESET_OPTIONS is (according to elin3kreg.h) really reading a
16 bit register - and is putting the result in a u_int16_t type variable.
This seems harmless, only the low 8 bits are ever used in elinkxl.c.
And last, the manipulation of the tx_succ_ok variable (field) looks
bogus to be - doing a test like
if (sc->tx_succ_ok < 100)
along with code that increments the value like
sc->tx_succ_ok = (sc->tx_succ_ok+1) & 127;
makes no sense at - the test cannot tell the difference between 7 and
135 increments, yet that is clearly what it is wanting to do.
The increment should probably be pegging the value at 127 (or anything
bigger than 100) once it reaches that high, as < 100 is all that
seems to matter, once it is bigger, how much bigger is irrelevant,
until the value is reset to 0 and it all starts again.
I'll leave that one for someone who understands what is really happening
to ponder.
kre
ps: the trivial patch is appended - please think about what is really
happening here, don't just blindly apply it.
Index: elinkxl.c
===================================================================
RCS file: /cvsroot/NetBSD/src/sys/dev/ic/elinkxl.c,v
retrieving revision 1.83
diff -u -r1.83 elinkxl.c
--- elinkxl.c 31 May 2005 01:48:22 -0000 1.83
+++ elinkxl.c 4 Oct 2005 15:48:01 -0000
@@ -789,9 +789,10 @@
/*
* We need to read+write TX_STATUS until we get a 0 status
* in order to turn off the interrupt flag.
+ * ELINK_TXSTATUS is in the upper byte of 2 with ELINK_TIMER
*/
- while ((i = bus_space_read_1(iot, ioh, ELINK_TXSTATUS)) & TXS_COMPLETE) {
- bus_space_write_1(iot, ioh, ELINK_TXSTATUS, 0x0);
+ while ((i = bus_space_read_2(iot, ioh, ELINK_TIMER)) & TXS_COMPLETE) {
+ bus_space_write_2(iot, ioh, ELINK_TIMER, 0x0);
if (i & TXS_JABBER) {
++sc->sc_ethercom.ec_if.if_oerrors;
@@ -813,6 +814,7 @@
ex_init(ifp);
/* TODO: be more subtle here */
} else if (i & TXS_MAX_COLLISION) {
+ ++sc->sc_ethercom.ec_if.if_oerrors;
++sc->sc_ethercom.ec_if.if_collisions;
bus_space_write_2(iot, ioh, ELINK_COMMAND, TX_ENABLE);
sc->sc_ethercom.ec_if.if_flags &= ~IFF_OACTIVE;