Port-macppc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: lockups on 6.0.2 - progress?



My environment:

PowerMac G4, 896 MBytes mem.  2 ATA disks on the internal bus/ribbon.
NetBSD 6.0.2.  Standard kernel.
Machine name: charm.icompute.com charm

I changed my test case so that the wget script runs on charm.  I disabled
the network interfaces and set up apache to listen on localhost.

The script looks like this:

---
#!/bin/ksh

set -e

while true ; do
        date
        wget -t 1 -T 8 -q -a logfile -O index.html http://127.0.0.1
        echo -n  index
        wget -t 1 -T 8 -q -a logfile -O text.txt http://127.0.0.1/text.txt
        echo text
done
---

When I run it, in a few hours the machine hangs.  It's not the hard hang
I get when I run the script on another machine, but it is a hang.
Without a network, I can't ping it, or ssh/telnet to it and run multiple
windows.  All i know is that ctrl-c does not produce a prompt from the shell,
and the script does not fail (timeout), but stops producing output.

Unlike when the script runs on another machine, the keyboard does
echo chars to the screen, but that's it.  (I think only retyrn chars are
echoed.... have to check next time it fails.

I've tried leaving differnt things running while the test is active,
and writing output to a file.  tail -f, top, systat - all behave the same.
When the hang comes, no response and no new shell prompt.

Bottom line
-=--=-=-=-=-=

I've eliminated the network cards.  *IF* this is the the same problem, it
looks like it's not in the drivers.

-dgl-

>Hi,
>
>I reported problems with gem(4) on macppc as a bug (kern/46083).
>As the system board is now broken, I can no longer test myself
>(or confirm that it's something related to the driver or an
>already broken board;  at least it was running with NetBSD 5.x
>and Linux without problems while on -6 I had an unstable gem(4)).
>
>Maybe both problems are related, even though I didn't see much
>output.  Running makemandb over nfs was enough to break gem(4)
>connection.  If you think it may be related, it might help
>to combine both bug-reports in a single PR (or if you haven't
>added any PR so far, add your experiences to the kern/46083).
>
>--
>Regards
>Matthias Kretschmer
>
>
>On Fri, May 31, 2013 at 09:03:42PM -0500, Donald Lee wrote:
>> I have been chasing lockups of NetBSD 6.0.1, and recently tried 6.0.2, and
>> have found that it locks up, too.  My problem is that this is intermittent,
>> so the first task is to find a failing test case.
>> 
>> I have a second machine set up that has hung up 3 times, twice with 6.0.2, 
>> and
>> once with 6.0.1.  The interesting difference is this i the log:
>> 
>> May 29 13:00:00 charm syslogd[151]: restart
>> May 29 21:52:13 charm /netbsd: arp info overwritten for 71.39.101.62 by 
>> 20:76:00:10:7f:14
>> May 30 14:44:08 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>> 75, complete 82
>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>> 0x810400
>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>> May 30 14:44:12 charm /netbsd: gem0: resetting anyway
>> May 30 15:01:45 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>> 20, complete 30
>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>> 0x810400
>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>> May 30 15:01:49 charm /netbsd: gem0: resetting anyway
>> May 30 18:15:30 charm /netbsd: gem0: receive error: RX overflow sc->rxptr 
>> 58, complete 70
>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: not in overflow state: 
>> 0x810400
>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>> May 30 18:15:34 charm /netbsd: gem0: resetting anyway
>> May 31 20:51:35 charm syslogd[151]: restart
>> 
>> 
>> I take this as a clue, and I am going to put in a PCI ethernet card, (SMC)
>> and see if that behaves differently.
>> 
>> Note that this message the "watchdog" thing with the reset is new in 6.0.2,
>> so I'm guessing that someone changed the gem driver - just a guess....
>> 
>> I'll report back.
>> 
>> It takes a day or two or three for the failure to occur.  I originally
>> thought it was a failure that happened under heavy disk load, but it
>> turns out that at least with the last couple of failures, it happens
>> on an almost idle machine.  The only "load" I have on it is a script that
>> does two wget's in a loop.  One wget is of a small index file, and the other 
>> is
>> of a 1 Meg file.  It does the wgets as fast as it can.  It seems to cause the
>> problem in a couple of days.
>> 
>> I have now swapped in the SMC ethernet card.  Let's see if it still fails.
>> If not, then I have a workaround, and we have a possible driver bug to
>> fix.
>> 
>> -dgl-


Home | Main Index | Thread Index | Old Index