Subject: fxp1: device timeout and panic: pool_get(%s): free list modified
To: None <port-alpha@netbsd.org>
From: Hal Murray <murray@pa.dec.com>
List: port-alpha
Date: 06/05/2000 23:51:29
[I thought I sent something like this a day or two ago, but I can't
find my copy.]
First, the panic.
I just got a second one: pool_get(%s): free list modified: magic=%x;
page %p; item addr %p.
I'm running network tests on a point-point link between a pair of
82558s - the fxp driver. This is on Alphas running 1.4Z. (Miata,
600au in case that matters.)
I haven't seen any troubles while running the same tests on a pair
of 400 MHz Celerons running 1.4Z.
At the time this happened, I was running an "easy" test. It keeps
the link very busy with traffic in both directions, but I call it
easy because it doesn't provoke any buffer overflows or exercise
any other uncommon code paths.
I'm running a request-response pattern test with 3 messages in flight
to keep everything busy. When things go right, this test will get
95 megabits in each direction. The case that crashed was using 17952
byte messages over UDP.
So there will never be more than 3*17952 bytes on any queue. Rounding
up for headers, that's 13 packets per message or 39 packets total.
That's shouldn't be a big deal.
The previous time it crashed I was running a UDP blast-em test on
the same hardware setup. That does provoke buffer overflows. This
time, I had run a blast-em test, but that was a long time ago - close
to an hour.
I've got both dumps. If anybody wants some info from them, tell
me what to type.
Now for the timeout. This seems suspicious. It might be related.
From the log file:
Jun 5 03:29:06 mckinley /netbsd: fxp1: device timeout
Jun 5 03:29:40 mckinley last message repeated 3 times
Jun 5 03:31:46 mckinley last message repeated 11 times
Jun 5 03:41:49 mckinley last message repeated 47 times
.....
I've looked at the code several times. It all looks OK to me. It
works on i386, at least so far. (I'll go hack the printf to provide
more info.)
Maybe interesting data...
Jun 5 23:17:26 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=3
Jun 5 23:17:55 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=6
Jun 5 23:22:21 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=17
Jun 5 23:23:05 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=17
Jun 5 23:24:49 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=20
Jun 5 23:24:55 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=29
I was running a UDP test at the time. Some of the timeouts didn't
lose any data! I think that means all the packets have been transmitted.
The problem is that they aren't getting cleaned up.
These machines have a quad 82558 card, a pair of Tulips, and FDDI
card, and an Alteon Gigabit card, so the crashes could be caused
by another driver and just provoked by the fxp dirver. I can comment
them out of the config if anybody is suspicious.
But that doesn't explain the timeouts.