Subject: Re: Strange network hang on Poweredge 860
To: Lars Friend <lfriend@mcci.com>
From: Chuck Swiger <cswiger@mac.com>
List: netbsd-help
Date: 09/10/2007 12:52:41
Hi, Lars--
On Sep 10, 2007, at 11:34 AM, Lars Friend wrote:
> Hello all,
> I've been experiencing a very strange mode of failure which
> has me
> scratching my head so I figured I'd ask here to see if anybody had
> seen
> something like this before.
>
> I have installed NetBSD 3.1 on a brand new Dell PowerEdge 860
> system (dual core P4 Xeon, 4GB ram, 2 SATA drives in software RAID
> using
> raidframe raid1).
[ ... ]
> So, we replaced the old system with our fancy new one, and
> four hours
> into operation, things get weird. The system is still running,
> everything seems okay,
> nothing unexpected or unpleasant in syslog, but the NIC is kaput.
> It sees link, seems to be
> okay, but it won't accept or make connections, pings, or any other
> network traffic.
[ ... ]
> Has anybody seen this before, or does anybody have a good
> hunch about what I can do
> to duplicate the failure? Once I can duplicate it "in captivity"
> it will be easier to debug, and easier
> to correct, but I would love to be able to duplicate it without
> putting it up live and letting it crash because
> that is not only a lot of work, but it inconveniences users who
> need to use the system.
There were a number of problems with the Broadcom NICs in Dell
machines reported on the FreeBSD lists, particularly in conjunction
with heavy UDP traffic such as NFS using the default transport. It
seems like the NIC would get confused about the state of the transmit
and receive buffers (some kind of refcounting problem?), and stop
passing traffic entirely, which sounds similar to the problem you've
reported.
There were also some initialization issues which tended to occur if
the NIC needed to be reset/woken up after entering an ACPI sleep
state, doing WOL, or similar. One of their engineers, David
Christensen <davidch@broadcom.com> has done work to fix them and to
improve the diagnostic messages so that better information is
reported when the adaptor gets confused.
You might find the threads here:
http://lists.freebsd.org/pipermail/freebsd-net/2007-June/thread.html
...such as "Problems with BCE network adapter (Dell PE2950)" to
contain some helpful info and code patches. It seems like the
OpenBSD folks have also implemented some fixes and workarounds for
PHY bugs in the BCM 575x/578x chipsets, going by:
http://leaf.dragonflybsd.org/mailarchive/commits/2007-05/
msg00036.html
Perhaps someone more familiar with the status of the BCM driver in
NetBSD could offer more detailed information than I can, but at least
you've got a starting point and the name of an Broadcom engineer who
has worked on their BSD drivers.
Regards,
--
-Chuck
PS: I wouldn't swap in a RealTek NIC given a choice-- the newer NICs
from them aren't bad, but the older ones seemed to be flaky as well;
instead I'd try a Intel Fast EtherExpress Pro ("fxp" to me, I think
NetBSD calls 'em "wm", though), or the DEC "tulip" 21x4x chips ("dc"
or "de" probably?)....