port-alpha: fxp1: device timeout and panic: pool

Subject: fxp1: device timeout and panic: pool_get(%s): free list modified
To: None <port-alpha@netbsd.org>
From: Hal Murray <murray@pa.dec.com>
List: port-alpha
Date: 06/05/2000 23:51:29
[I thought I sent something like this a day or two ago, but I can't 
find my copy.]

First, the panic.

I just got a second one: pool_get(%s): free list modified: magic=%x; 
page %p; item addr %p. 

I'm running network tests on a point-point link between a pair of 
82558s - the fxp driver.  This is on Alphas running 1.4Z.  (Miata, 
600au in case that matters.) 

I haven't seen any troubles while running the same tests on a pair 
of 400 MHz Celerons running 1.4Z.

At the time this happened, I was running an "easy" test.  It keeps 
the link very busy with traffic in both directions, but I call it 
easy because it doesn't provoke any buffer overflows or exercise 
any other uncommon code paths. 

I'm running a request-response pattern test with 3 messages in flight 
to keep everything busy.  When things go right, this test will get 
95 megabits in each direction.  The case that crashed was using 17952 
byte messages over UDP. 

So there will never be more than 3*17952 bytes on any queue.  Rounding 
up for headers, that's 13 packets per message or 39 packets total.  
That's shouldn't be a big deal. 

 
The previous time it crashed I was running a UDP blast-em test on 
the same hardware setup.  That does provoke buffer overflows.  This 
time, I had run a blast-em test, but that was a long time ago - close 
to an hour. 

I've got both dumps.  If anybody wants some info from them, tell 
me what to type.


Now for the timeout.  This seems suspicious.  It might be related. 

From the log file:

Jun  5 03:29:06 mckinley /netbsd: fxp1: device timeout
Jun  5 03:29:40 mckinley last message repeated 3 times
Jun  5 03:31:46 mckinley last message repeated 11 times
Jun  5 03:41:49 mckinley last message repeated 47 times
.....

I've looked at the code several times.  It all looks OK to me.  It 
works on i386, at least so far. (I'll go hack the printf to provide 
more info.) 

Maybe interesting data... 

Jun  5 23:17:26 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=3
Jun  5 23:17:55 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=6
Jun  5 23:22:21 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=17
Jun  5 23:23:05 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=17
Jun  5 23:24:49 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=20
Jun  5 23:24:55 foraker /netbsd: fxp1: device timeout: txpending=128, snd.ifq_len=29

I was running a UDP test at the time.  Some of the timeouts didn't 
lose any data!  I think that means all the packets have been transmitted.  
The problem is that they aren't getting cleaned up.




These machines have a quad 82558 card, a pair of Tulips, and FDDI 
card, and an Alteon Gigabit card, so the crashes could be caused 
by another driver and just provoked by the fxp dirver.  I can comment 
them out of the config if anybody is suspicious. 

But that doesn't explain the timeouts.