Subject: Re: NFS writes NetBSD vs FreeBSD
To: None <port-alpha@netbsd.org>
From: Stephen Jones <smj@cirr.com>
List: port-alpha
Date: 07/16/2004 10:59:16
Well, both machines hung hard late last night with no signs of why and
I unfortunately
do not have access to the halt switches on either to try to continue
from SRM into a debugger.
But, my tests ran for about an hour before the FreeBSD CS20 hung hard
first. While power
cycling it to reboot the NetBSD CS20 hung hard (no doubt the loss of
its NFS mount
triggered some problems).
My goals are simple .. reliability and user experience... I'm not
really looking at getting
the fastest chunk of data across an NFS mount as long speed is decent
and other
processes and operations running on that mount aren't negatively
affected. To simulate
my users, I did the following on both machines:
1. Allocated a 520mb md/mfs and had dd continously write 512mb of zeros
to it
(both CS20s use the same type of memory)
2. ftp a 500 files of zeros from a local disk on the remote to a local
disk locally
repeatedly on the primary (fxp1)
3. write 200mb of zeros repeatedly on each other's nfs mounted
filesystem (via fxp0)
What was interesting is that the FreeBSD load report was strange... I
saw 0.19 at
the height of it, yet the system felt very sluggish and local and
remote operations
had a significant delay measurable in seconds .. sometimes the system
would appear
to be hung with no response even from a ^T and then suddenly come back
to life with
no errors reported. Writing 512mb to the memory disk initially took
about 6.7 seconds
but, as you'd expect, slowly moving up to about 22.4 seconds over 257
iterations before hanging.
NetBSD started out at 4.3 seconds and slowly made its way up to 22.4
over 268
iterations before hard hanging shortly after FreeBSD hung.
The transfer numbers again aren't super important to me, but rather how
the systems
felt during that hour. I had a loop with a 20 second sleep doing ls
operations on each
NFS mounted filesystem and I found that the NetBSD CS20 completed using
less
real seconds (by a few, sometimes more) than the same operations
running on the
FreeBSD CS20 .. I also had top running on both to monitor process
states or vnlock
nfsrcvlk deadlocks. I'm guessing I'd have to get something like this
into production
to actually see if vnlock deadlocks even occur. As we had suspected,
once the fxp
driver was straightened out this would improve.
Starting top on FreeBSD usually takes a several seconds, but during the
high load and
activity it took several minutes. On NetBSD top started up within a
few seconds, and
ps, vmstat, netstat, pstat .. all responded faster than on the FreeBSD
CS20.
After both hung hard, I rebooted them and just started #3 up, which has
run continously
all night (at least for the past 11 hours).
No NFS timeout / not responding/alive messages have been reported by
either, but
we assumed that would go away once the fxp driver was sorted out.
There was no evidence of what caused the hard hangs and I think that is
the toughest
problem. Michael Hitch pointed out that it is possible to get to a
debugger, but you need
some sort of minion to hit the halt switch for you. I've got a scrap
CS20 here I'm going to
try to wire up an APC remote to (MP or not, I had power cycling CS20s
or any computer).
I suppose what I could do is try running the same abusive test single
CPU kernels and
hopefully get the panic message.