netbsd-bugs: kern/28971: NFS thread goes into readdir tight loop

Subject: kern/28971: NFS thread goes into readdir tight loop
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <eravin@panix.com>
List: netbsd-bugs
Date: 01/15/2005 02:27:00
>Number:         28971
>Category:       kern
>Synopsis:       NFS thread goes into readdir tight loop
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 15 02:27:00 +0000 2005
>Originator:     Ed Ravin
>Release:        NetBSD 2.0
>Organization:
PANIX
>Environment:
NetBSD  netbsd2.0.client 2.0 NetBSD 2.0 (PANIX-STD) #2: Mon Jan 10 17:12:07 EST 2005  root@netbsd2.0.client:/devel/netbsd/2.0/src/sys/arch/i386/compile/PANIX-STD i386
>Description:
We received an alarm that our NFS server was handling more than 4000 NFS operations per second, usually a sign of a runaway program on a client box.

The client turned out to be a NetBSD 2.0 system.  It was in a tight loop sending out the same READDIR request to the NFS server over and over again.  See trace below.  There was no process that seemed to account for the requests.  I used lsof, fstat, ps, and top, and could not find any process with a file open to the NFS server.  Four NFS threads were configured.  One of them showed high CPU time with ps, the others were idle.  I tried reducing the number of kernel NFS threads to zero using sysctl but the high-usage thread remained.  The only fix was to reboot the box.

Sample of trace follows.  More of trace is available upon request.

17:16:02.593074 IP (tos 0x0, ttl  64, id 25772, offset 0, flags [none], length: 156) netbsd2.0.client.691067607 > netapp-6.1.server.nfs: 128 readdir fh Unknown/81DA0E00CBCE1D0020000000000EDA81CBCE1D00B121000081DA0E00CBCE1D00 8192 bytes @ 0x1000 verf 0000000000000000
17:16:02.593196 IP (tos 0x0, ttl  64, id 15741, offset 0, flags [none], length: 160) netapp-6.1.server.nfs > netbsd2.0.client.691067607: reply ok 132 readdir POST: DIR 755 ids 0/0 sz 0x1000 nlink 13 rdev 0/0 fsid 0x21b1 fileid 0xeda81 a/m/ctime 1104963361.850010000 1101258830.589001000 1101258830.589001000 verf 0000000000000000
17:16:02.593303 IP (tos 0x0, ttl  64, id 25773, offset 0, flags [none], length: 156) netbsd2.0.client.691067608 > netapp-6.1.server.nfs: 128 readdir fh Unknown/81DA0E00CBCE1D0020000000000EDA81CBCE1D00B121000081DA0E00CBCE1D00 8192 bytes @ 0x1000 verf 0000000000000000
17:16:02.593421 IP (tos 0x0, ttl  64, id 15997, offset 0, flags [none], length: 160) netapp-6.1.server.nfs > netbsd2.0.client.691067608: reply ok 132 readdir POST: DIR 755 ids 0/0 sz 0x1000 nlink 13 rdev 0/0 fsid 0x21b1 fileid 0xeda81 a/m/ctime 1104963361.850011000 1101258830.589001000 1101258830.589001000 verf 0000000000000000
17:16:02.593526 IP (tos 0x0, ttl  64, id 25774, offset 0, flags [none], length: 156) netbsd2.0.client.691067609 > netapp-6.1.server.nfs: 128 readdir fh Unknown/81DA0E00CBCE1D0020000000000EDA81CBCE1D00B121000081DA0E00CBCE1D00 8192 bytes @ 0x1000 verf 0000000000000000

>How-To-Repeat:
We don't know how to repeat it.  It happened to us twice so far, on two separate Netapp / NetBSD 2.0 client boxes, on the same day.  No other occurences.
>Fix: