tech-kern: locking bug in coda

Subject: locking bug in coda_lookup?
To: None <tech-kern@netbsd.org>
From: Greg Troxel <gdt@ir.bbn.com>
List: tech-kern
Date: 08/31/2004 09:48:33

I got a panic in coda_lookup:

  unlocked parent but couldn't lock child

while doing

  find . -type f -print0 | xargs -0 cat > /dev/null

in coda.  I'm not certain which of two very similarfragments the code
was in (don't have netbsd.gdb any more), but it was very much like
this:

	    if (*ap->a_vpp) {
		if ((error = vn_lock(*ap->a_vpp, LK_EXCLUSIVE))) {
		    printf("coda_lookup: ");
		    panic("unlocked parent but couldn't lock child");
		}
	    }

It seems that this will fail if someone else has the vnode locked, and
there will be no retry.  From reading vn_lock, it seems that if one
passes LK_NOWAIT, that on encountering a locked vnode, vn_lock returns
immediately.  If LK_NOWAIT is not set, and LK_RETRY is also not set,
it seems that vn_lock will tsleep on the vnode's v_interlock, and then
return ENOENT instead of retrying.  The ufs code uses LK_RETRY in what
I think is the analogous case.

I don't understand why it makes sense to sleep and not retry the lock
- if one isn't going to retry, what's the point of sleeping?

So, I think that coda_lookup should pass LK_RETRY.  But I either don't
quite or just barely understand vnode locking, so I'd appreciate
advice here.

Also, it seems that the coda lookup operation doesn't properly handle
IS_DOTDOT where the locking rules are different.  That could be
related to my crash instead.