Subject: Re: CVS commit: src/sys
To: Andrew Doran <ad@netbsd.org>
From: Bill Stouder-Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/30/2007 16:01:27
--6c2NcOVqGQ03X4Wi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Sun, Jul 29, 2007 at 02:17:22PM +0100, Andrew Doran wrote:
> On Sun, Jul 29, 2007 at 12:40:37PM +0000, Antti Kantee wrote:
>=20
> > Modified Files:
> > src/sys/kern: kern_lock.c vfs_subr.c
> > src/sys/sys: lock.h
> >=20
> > Log Message:
> > Define a new lockmgr flag LK_RESURRECT which can be used in
> > conjunction with LK_DRAIN. This has the same effect as LK_DRAIN
> > except it atomically does NOT mark the lock as drained. This
> > guarantees that when we got the lock, we were the last one currently
> > waiting for the lock.
> >=20
> > Use LK_DRAIN|LK_RESURRECT in vclean() to make sure there are no
> > waiters for the lock. This should fix behaviour theoretized to be
> > caused by vfs_subr.c 1.289 which caused vclean() to run into
> > completion and free the vnode before all lock-waiters had been
> > processed. Should therefore fix the "simple_lock: unitialized lock"
> > problems seen recently.
>=20
> I think this is a good stopgap for fixing NetBSD 4, but also that it's the
> wrong way to solve the problem. The issue is that there are two threads
> trying to deactivate the vnode, and eventually one goes on to reclaim. Th=
at
> can be prevented by continuing to hold a reference to the vnode across the
> call to deactivate it. Once it's deactivated (and we've checked it doesn't
> need to be re-deactivated due to a ref in the meantime) we should go on to
> lock it and reclaim it. The premise being that no-one should try to lock a
> vnode without first holding a reference to it.
Do you have references describing this issue? I'm not sure where it'd=20
occur.
The layering issue is that we export the struct lock to upper layer file=20
systems. This is in part an efficiency concern and part an implementation=
=20
(a simplistic implementation) of Heidemann's caching ideas (see his=20
disseration which I think I have as a .ps file if you need).
In the layer case, the problem is that there are still references to the=20
lock after we clean the vnode. Which is why we want to reactivate it.
> On the other hand, maybe it's time we looked at pushing the locking back
> into the file systems and being more strict about the reference counting,=
so
> that these kinds of issues don't continue to appear. :-). What do people
> think of the idea? Some if the issues involved in doing that I know of ar=
e..
I used to dislike the idea, but I now like it. I think it's the right way=
=20
to go in the long run. Well, I think we may still need=20
VOP_LOCK()/VOP_UNLOCK(), but they'd be much more like the SVR4 versions -=
=20
you only use them when you need to make an atomic sequence of a number of=
=20
VOPs. Most uses just do all locking internally.
> o VOP_LOCK and VOP_UNLOCK would need to go away. To make VOP calls you'd
> only need a reference to the vnode.
See above. I mostly agree.
> o We'd need a VOP_RWLOCK or VOP_RANGELOCK for concurrent reads/writes.
Nah, just have the FS do this internally when an operation comes in.
> o The calls to VOP_LEASE look suspicious, since they are made and then
> followed by other VOP calls under the same lock. VOP_LEASE doesn't appe=
ar
> to do anything these days though.
I think we've dropped NQNFS support, so it doesn't. We should check that=20
NFSc4 implementations don't need it, though.
> o There are some place where VOP_ACCESS is followed by a another call into
> the file system. Those checks would need to go into the file system.
Not necessarily.
I think the real test is to compare the # of call points that have=20
VOP_ACCESS() now with the number of internal calls we'd have if we pushed=
=20
it in. # callers now vs. # file system types * # VOPs that need these=20
checks internalls. Note I'm assuming the FSs don't implicitly still do=20
VOP_ACCESS() checks anyway now.
> o Lookup currently holds exclusive vnode locks. It causes a lot of single
> threading and I think we could get a good speedup by avoiding that. My
> understanding is that's due to the inode/vnode being used to maintain
> state for whoever's doing the lookup. What other issues are there? With=
out
> understanding the code too clearly I imagine that has effects on layere=
d=20
> file systems too.
Agreed.
There are two parts of lookup, getting to the last component and looking=20
up the last component. I don't think it'd be that bad to use shared locks=
=20
on the former, and we'd get a lot of parallelism.
I think SVR4 pushes the last component lookup into the VOP, so if we=20
changed VOPs we remove the need for external exclusive locks in the latter=
=20
part.
Create makes good use of the exclusive lock. While looking to see if the=20
file exists, we find where we want to put it if we have to add it.
Rename would make good use of it if we did it differently. The idea is to=
=20
remember where the entry was in the directory when we do the initial=20
lookup to see if the to-be-moved file exists. The problem is that we drop=
=20
locks while looking up the destination directory.
> o To eliminate v_interlock's part in locking a vnode, it would be difficu=
lt
> to be sure that we have completely reclaimed vnodes from a file system -
> because we can't be absolutely sure at which point calls stop being made
> into foo_vnops and start being made into dead_vnops. That would mean:
> after a forced unmount the file system code would need to stick around
> (not be unloaded) while there are still vnodes that it spawned. I guess=
we
> could add a reference count to the vnode to count the number of LWPs
> currently within a VOP and poll until the figure drops to zero for all =
the
> remaining vnodes. That's a lot of additional atomic ops for the system =
to
> do, though.
I don't see how v_interlock matters here. I agree though that we need to=20
do something to deal with killing concurrent VOPs when wiping out a vnode.
Perhaps the best way to handle this is to make sys_revoke() processing=20
different from revoke processing. Have sys_revoke() processing swap the=20
ops vector for one that marks all operations as errors except for=20
VOP_REVOKE, and leave revoke processing the same as for a normal node.=20
That way any existing references start reporting errors, and we should=20
drain off any access ASAP. Any concurrent access, though, remains &=20
completes. This would work well for the sys_revoke() issue that started=20
this discussion, and I'm not sure how well it'd work for what you=20
describe. Draining on some count may be what we'd have to do...
Take care,
Bill
--6c2NcOVqGQ03X4Wi
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (NetBSD)
iD8DBQFGrm3HWz+3JHUci9cRAkBUAJ9I5sb3VDY0nhG9hQgMalabTiEPQQCcC+KC
wy93H5a4vKU/og26z6pRrSY=
=DV7Q
-----END PGP SIGNATURE-----
--6c2NcOVqGQ03X4Wi--