tech-kern: Re: nfsd: locking botch in op %d

Subject: Re: nfsd: locking botch in op %d
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 03/10/2001 13:34:32
>>>> The NFS server on my house LAN's NFS subnet fell over with "nfsd:
>>>> locking botch in op 3".
>>> It also crashed when doing a lookup for a device node ("sd0a" in
>>> your case), curiously enough, so there may be a problem there.

I believe it does.

I threw in debugging code.  Basically, I added a private struct proc *
between nfs_syscalls.c and kern_lock.c; when a LOOKUP request is being
processed, all lock operations affecting the relevant process get a log
of debugging output.  I also added __FILE__ and __LINE__ stuff,
somewhat akin to what I find in -current's kern_lock.c, but not
actually stored in the lock, just printed.

It slows things down, but it's given me some very useful info.

In particular, I can match up locks and unlocks.  Here's the relevant
trace, which I've postprocessed by hand for readability and to bring
out particularly interesting aspects.  The leading letter is a short
version of the lock pointer which is printed in hex later in the line.

A genfs_vnops.c, 317: locks 0 EXCLUSIVE on 0xf1af51c4 -> 0 locks 1
A genfs_vnops.c, 334: locks 1 RELEASE on 0xf1af51c4 -> 0 locks 0
A genfs_vnops.c, 317: locks 0 EXCLUSIVE on 0xf1af51c4 -> 0 locks 1

B ffs_vfsops.c, 1009: locks 1 EXCLUSIVE on 0xf026ba58 -> 0 locks 2

C vfs_subr.c, 246: locks 2 SHARED on 0xf039aa18 -> 0 locks 3
C vfs_subr.c, 259: locks 3 RELEASE on 0xf039aa18 -> 0 locks 2

D ufs_ihash.c, 131: locks 2 EXCLUSIVE on 0xf1b0c4e4 -> 0 locks 3

B ffs_vfsops.c, 1043: locks 3 RELEASE on 0xf026ba58 -> 0 locks 2

E ufs_vnops.c, 1923: locks 2 EXCLUSIVE on 0xf1a544ac -> 0 locks 3

A genfs_vnops.c, 334: locks 3 RELEASE on 0xf1af51c4 -> 0 locks 2
A genfs_vnops.c, 317: locks 2 EXCLUSIVE on 0xf1af51c4 -> 0 locks 3
A genfs_vnops.c, 334: locks 3 RELEASE on 0xf1af51c4 -> 0 locks 2

E genfs_vnops.c, 334: locks 2 RELEASE on 0xf1a544ac -> 0 locks 1

At this point the problem is present.  Looking at locks and unlocks,
they all match up except for D, which never gets unlocked.

Looking at the routines those file/line pairs indicate, it appears to
me that what's happening is more or less likethis.

A LOOKUP happens on a device node.  ffs_vget is called, misses in the
inode cache, and takes ufs_hashlock (the first B line, which is the
do-while near the beginning of ffs_vget).  Everything proceeds
normally, and it calls ufs_hashins() some 40 lines farther into
ffs_vget; the unmatched D lock is taken in ufs_hashins(), and is on the
affected inode's vnode.  ffs_vget then releases ufs_hashlock (the
second B line).  Then another 35-40 lines farther down, ufs_vinit is
called.  It promptly calls checkalias(), which finds the vnode aliased
(the server apparently has a live vnode for its own sd0a, not
surprising since its own root is on sd0a, and the client's sd0a is the
same <major,minor> as the server's).  It appears that the lock
ufs_hashins() took on the vnode gets lost somewhere in the checkalias()
shuffle.

>>> Are you using softdeps on the server, btw?
>> No.  I've never even tried to use softdeps, anywhere.
> I am using softdeps, which was where Frank thought the problem might
> be.

Well, if so, it's in a part of softdeps that's used even if softdeps
aren't enabled on the filesystems in question. :-)

But based on what I've outlined above I really think it's more likely a
problem with checkalias() not correctly merging locks when collapsing
aliased vnodes.

					der Mouse

			       mouse@rodents.montreal.qc.ca
		     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B