Port-macppc archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: lockups on 6.0.2 - progress?
Following up.....
This test - running the wget script to localhost instead of to ethernet -
does not produce the same failure. The failure appears to be some sort of
problem with the keyboard/display/terminal machinery, not the kernel.
When I leave the ethernet up, and run the script, and send the output to
a file (just like the failing case below), it runs for days. If I
ssh into the machine, I can see it's still running.
I have to generate the load over the network to cause the failure,
it appears, so the network/network driver is likely part of the problem.
-dgl-
At 4:20 PM -0500 6/6/13, Donald Lee wrote:
>My environment:
>
>PowerMac G4, 896 MBytes mem. 2 ATA disks on the internal bus/ribbon.
>NetBSD 6.0.2. Standard kernel.
>Machine name: charm.icompute.com charm
>
>I changed my test case so that the wget script runs on charm. I disabled
>the network interfaces and set up apache to listen on localhost.
>
>The script looks like this:
>
>---
>#!/bin/ksh
>
>set -e
>
>while true ; do
> date
> wget -t 1 -T 8 -q -a logfile -O index.html http://127.0.0.1
> echo -n index
> wget -t 1 -T 8 -q -a logfile -O text.txt http://127.0.0.1/text.txt
> echo text
>done
>---
>
>When I run it, in a few hours the machine hangs. It's not the hard hang
>I get when I run the script on another machine, but it is a hang.
>Without a network, I can't ping it, or ssh/telnet to it and run multiple
>windows. All i know is that ctrl-c does not produce a prompt from the shell,
>and the script does not fail (timeout), but stops producing output.
>
>Unlike when the script runs on another machine, the keyboard does
>echo chars to the screen, but that's it. (I think only retyrn chars are
>echoed.... have to check next time it fails.
>
>I've tried leaving differnt things running while the test is active,
>and writing output to a file. tail -f, top, systat - all behave the same.
>When the hang comes, no response and no new shell prompt.
>
>Bottom line
>-=--=-=-=-=-=
>
>I've eliminated the network cards. *IF* this is the the same problem, it
>looks like it's not in the drivers.
>
>-dgl-
>
>>Hi,
>>
>>I reported problems with gem(4) on macppc as a bug (kern/46083).
>>As the system board is now broken, I can no longer test myself
>>(or confirm that it's something related to the driver or an
>>already broken board; at least it was running with NetBSD 5.x
>>and Linux without problems while on -6 I had an unstable gem(4)).
>>
>>Maybe both problems are related, even though I didn't see much
>>output. Running makemandb over nfs was enough to break gem(4)
>>connection. If you think it may be related, it might help
>>to combine both bug-reports in a single PR (or if you haven't
>>added any PR so far, add your experiences to the kern/46083).
>>
>>--
>>Regards
>>Matthias Kretschmer
>>
>>
>>On Fri, May 31, 2013 at 09:03:42PM -0500, Donald Lee wrote:
>>> I have been chasing lockups of NetBSD 6.0.1, and recently tried 6.0.2, and
>>> have found that it locks up, too. My problem is that this is intermittent,
>>> so the first task is to find a failing test case.
>>>
>>> I have a second machine set up that has hung up 3 times, twice with 6.0.2,
>>> and
>>> once with 6.0.1. The interesting difference is this i the log:
>>>
>>> May 29 13:00:00 charm syslogd[151]: restart
>>> May 29 21:52:13 charm /netbsd: arp info overwritten for 71.39.101.62 by
>>> 20:76:00:10:7f:14
>>> May 30 14:44:08 charm /netbsd: gem0: receive error: RX overflow sc->rxptr
>>> 75, complete 82
>>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: not in overflow state:
>>> 0x810400
>>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>>> May 30 14:44:12 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>>> May 30 14:44:12 charm /netbsd: gem0: resetting anyway
>>> May 30 15:01:45 charm /netbsd: gem0: receive error: RX overflow sc->rxptr
>>> 20, complete 30
>>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: not in overflow state:
>>> 0x810400
>>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>>> May 30 15:01:49 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>>> May 30 15:01:49 charm /netbsd: gem0: resetting anyway
>>> May 30 18:15:30 charm /netbsd: gem0: receive error: RX overflow sc->rxptr
>>> 58, complete 70
>>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: not in overflow state:
>>> 0x810400
>>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: wr pointer != saved
>>> May 30 18:15:34 charm /netbsd: gem0: rx_watchdog: rd pointer != saved
>>> May 30 18:15:34 charm /netbsd: gem0: resetting anyway
>>> May 31 20:51:35 charm syslogd[151]: restart
>>>
>>>
>>> I take this as a clue, and I am going to put in a PCI ethernet card, (SMC)
>>> and see if that behaves differently.
>>>
>>> Note that this message the "watchdog" thing with the reset is new in 6.0.2,
>>> so I'm guessing that someone changed the gem driver - just a guess....
>>>
>>> I'll report back.
>>>
>>> It takes a day or two or three for the failure to occur. I originally
>>> thought it was a failure that happened under heavy disk load, but it
>>> turns out that at least with the last couple of failures, it happens
>>> on an almost idle machine. The only "load" I have on it is a script that
>>> does two wget's in a loop. One wget is of a small index file, and the
>>> other is
>>> of a 1 Meg file. It does the wgets as fast as it can. It seems to cause
>>> the
>>> problem in a couple of days.
>>>
>>> I have now swapped in the SMC ethernet card. Let's see if it still fails.
>>> If not, then I have a workaround, and we have a possible driver bug to
>>> fix.
>>>
>>> -dgl-
Home |
Main Index |
Thread Index |
Old Index