Subject: Re: nfsd: locking botch in op %d
To: None <tech-kern@netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-kern
Date: 03/10/2001 13:34:32
>>>> The NFS server on my house LAN's NFS subnet fell over with "nfsd:
>>>> locking botch in op 3".
>>> It also crashed when doing a lookup for a device node ("sd0a" in
>>> your case), curiously enough, so there may be a problem there.
I believe it does.
I threw in debugging code. Basically, I added a private struct proc *
between nfs_syscalls.c and kern_lock.c; when a LOOKUP request is being
processed, all lock operations affecting the relevant process get a log
of debugging output. I also added __FILE__ and __LINE__ stuff,
somewhat akin to what I find in -current's kern_lock.c, but not
actually stored in the lock, just printed.
It slows things down, but it's given me some very useful info.
In particular, I can match up locks and unlocks. Here's the relevant
trace, which I've postprocessed by hand for readability and to bring
out particularly interesting aspects. The leading letter is a short
version of the lock pointer which is printed in hex later in the line.
A genfs_vnops.c, 317: locks 0 EXCLUSIVE on 0xf1af51c4 -> 0 locks 1
A genfs_vnops.c, 334: locks 1 RELEASE on 0xf1af51c4 -> 0 locks 0
A genfs_vnops.c, 317: locks 0 EXCLUSIVE on 0xf1af51c4 -> 0 locks 1
B ffs_vfsops.c, 1009: locks 1 EXCLUSIVE on 0xf026ba58 -> 0 locks 2
C vfs_subr.c, 246: locks 2 SHARED on 0xf039aa18 -> 0 locks 3
C vfs_subr.c, 259: locks 3 RELEASE on 0xf039aa18 -> 0 locks 2
D ufs_ihash.c, 131: locks 2 EXCLUSIVE on 0xf1b0c4e4 -> 0 locks 3
B ffs_vfsops.c, 1043: locks 3 RELEASE on 0xf026ba58 -> 0 locks 2
E ufs_vnops.c, 1923: locks 2 EXCLUSIVE on 0xf1a544ac -> 0 locks 3
A genfs_vnops.c, 334: locks 3 RELEASE on 0xf1af51c4 -> 0 locks 2
A genfs_vnops.c, 317: locks 2 EXCLUSIVE on 0xf1af51c4 -> 0 locks 3
A genfs_vnops.c, 334: locks 3 RELEASE on 0xf1af51c4 -> 0 locks 2
E genfs_vnops.c, 334: locks 2 RELEASE on 0xf1a544ac -> 0 locks 1
At this point the problem is present. Looking at locks and unlocks,
they all match up except for D, which never gets unlocked.
Looking at the routines those file/line pairs indicate, it appears to
me that what's happening is more or less likethis.
A LOOKUP happens on a device node. ffs_vget is called, misses in the
inode cache, and takes ufs_hashlock (the first B line, which is the
do-while near the beginning of ffs_vget). Everything proceeds
normally, and it calls ufs_hashins() some 40 lines farther into
ffs_vget; the unmatched D lock is taken in ufs_hashins(), and is on the
affected inode's vnode. ffs_vget then releases ufs_hashlock (the
second B line). Then another 35-40 lines farther down, ufs_vinit is
called. It promptly calls checkalias(), which finds the vnode aliased
(the server apparently has a live vnode for its own sd0a, not
surprising since its own root is on sd0a, and the client's sd0a is the
same <major,minor> as the server's). It appears that the lock
ufs_hashins() took on the vnode gets lost somewhere in the checkalias()
shuffle.
>>> Are you using softdeps on the server, btw?
>> No. I've never even tried to use softdeps, anywhere.
> I am using softdeps, which was where Frank thought the problem might
> be.
Well, if so, it's in a part of softdeps that's used even if softdeps
aren't enabled on the filesystems in question. :-)
But based on what I've outlined above I really think it's more likely a
problem with checkalias() not correctly merging locks when collapsing
aliased vnodes.
der Mouse
mouse@rodents.montreal.qc.ca
7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B