tech-kern: Re: Nfs clients get frozen when NFS server crashes...

Subject: Re: Nfs clients get frozen when NFS server crashes...
To: None <netbsd-users@netbsd.org, tech-kern@netbsd.org>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 10/04/2000 01:27:57
[ On Wednesday, October 4, 2000 at 14:57:47 (+1100), Robert Elz wrote: ]
> Subject: Re: Nfs clients get frozen when NFS server crashes... 
>
> On the off chance it wasn't clear, reboots of the server are generally
> no great problem.   It is (semi-)prolonged outages (hour or more) that
> cause most problems (though I suspect it depends a lot on how much NFS
> activity is scheduelled while the server is down).

Indeed I wasn't quite sure what exactly was happening, though that may
not have been due to any lack of clarity in your post....

I do tend to try to keep clients from attempting interactions
unnecessarily with the server when it is down and indeed its usually
just my own jobs on various machines that are effected since most of my
casual users would login directly to the server itself.  I *think* I do
this because I know that if I continue to make the client try accesses
then the load on the server will be harder on it after it recovers.  I
don't always notice (or correctly diagnose) the problem right away
though and often as a result I queue up more operations than I would do
if I'd noticed the problem immediately.

I've also sometimes taken the opportunity to reboot at least my diskless
client when the server's been down for any extent (since usually any
downtime longer than it takes to initiate a reboot is for semi-planned
purposes).

However like I say I don't think I've ever encountered a situation where
I've had to reboot an NFS client after the server's been down for even
extensive periods of time (eg. over a long weekend).

That's not to say that I haven't done so under duress though.  I did
have a *lot* of trouble trying to do TCP mounts from a BSDI-1.1 system
and often had to reboot it.  However those were usually "emergencies"
because it was the gateway at the time and I never had time to wait for
recovery timeouts -- server crashes always happen at the worst possible
times!  As I recall though those problems went away when I switched back
to using UDP mounts.

I've never even tried to simulate dozens or hundreds of unrestrained
clients trying to pound on a dead server, and the only sizable
production networks I've dealt with have not been running NetBSD on the
server side (nor even on any more than a very few clients).  I don't
know quite how it could happen, but perhaps the recovered server fails
to handle some of the barrage of transactions (or maybe fails to handle
them in the normal order) and maybe then the client can get stuck in an
unrecoverable state.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>