Subject: kern/32318: NFS client or server hang
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-bugs
Date: 12/16/2005 19:00:01
>Number: 32318
>Category: kern
>Synopsis: NFS client or server hang
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Dec 16 19:00:01 +0000 2005
>Originator: Manuel Bouyer
>Release: NetBSD 3.0_RC3
>Organization:
>Environment:
System: NetBSD chassiron.antioche.eu.org 3.0_RC3 NetBSD 3.0_RC3 (CHASSIRON) #0: Sat Nov 26 15:11:16 CET 2005 bouyer@pop.lip6.fr:/local/pop1/bouyer/tmp/sparc/obj/local/pop1/bouyer/netbsd-3/src/sys/arch/sparc/compile/CHASSIRON sparc
Architecture: sparc
Machine: sparc
>Description:
Setup: I get mail from various pop3 server via fetchmail and
deliver to local folders (mbox format) via procmail, the folders are
stored on a NFS server.
fetchmail/procmail run on a x86 box (celeron 500) running a months-old
current:
NetBSD rochebonne.antioche.eu.org 3.99.7 NetBSD 3.99.7 (ROCHEBONNE) #1: Tue Aug 9 23:54:57 CEST 2005 bouyer@pop.lip6.fr:/local/pop1/bouyer/tmp/i386/obj/local/pop1/bouyer/current/src/sys/arch/i386/compile/ROCHEBONNE i386
The NFS server is a sparc IPX (40Mhz sparcv7).
Problem: from time to time, the process accessing the files on
the NFS server hang. This usually happens when the client does
2 concurent accesses to the mailboxes (e.g. reading a mailbox
with mutt while procmail tries to deliver a mail to this mailbox).
I've seen this also before the 3.0 branch was cut, with the NFS server
running 2.0 or 2.1. I've never noticed this when the server was running
1.6.2 (it started happening when the server got upgraded).
Doing a /etc/rc.d/nfsd restart on the server unwedge the processes
on the client box.
Today I managed to reproduce this with a tcpdump running.
The full trace is at:
ftp://chassiron.antioche.eu.org/pub/private/nfs.hang.gz
(the hang begins at 19:19:35, I ran the /etc/rc.d/nfsd restart at
19:23:03).
When the processes are stuck, the only traffic between
the client and server are:
19:19:35.106216 IP rochebonne.antioche.eu.org.82 > chassiron.localhost.nfs: 40 n
ull
19:19:35.108362 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.82: repl
y ok 24 null
Before that the server sent a stream of
19:19:24.927421 IP chassiron.localhost.nfs > rochebonne.antioche.eu.org.1098072401: reply ERR 1460
I'm not sure if it's normal or not (is this an error, or a normal
reply to a read ?)
It also looks like the client opened a second TCP connection at
19:19:26.792149, maybe for the concurrent accesses ?
To me it looks like this request:
19:19:26.845210 IP rochebonne.antioche.eu.org.809670347 > chassiron.localhost.nfs: 148 lookup fh 25,15/13347 "_bX.uUwoDB.rochebonne.antioch"
got no reply and this is what caused the hang. After the nfsd restart,
the same request was sent 2 times, the second one got the reply
"no such file or directory"
Now I don't know if this is a client or server side issue. The
server seems to loose requests, but is the client supposed to
retry with NFS over TCP ?
>How-To-Repeat:
Try concurent accesses to the same file or directory against
a slow NFS server ?
>Fix:
yes, please