Subject: Re: More vnd problems
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Chuck Silvers <chuq@chuq.com>
List: port-xen
Date: 04/30/2005 17:29:40
(cc'd to tech-kern also, since this isn't xen-specific)
On Sun, Apr 24, 2005 at 11:30:11AM -0700, Chuck Silvers wrote:
> On Thu, Apr 21, 2005 at 12:50:03PM +0200, Manuel Bouyer wrote:
> > On Wed, Apr 20, 2005 at 06:48:13PM -0700, Chuck Silvers wrote:
> > > this is probably PR 12189.
> > > the vnd code has serious design flaws.
> > >
> > > you can reduce the likelihood of hitting that problem
> > > by using the raw vnd device instead of the block vnd device
> > > as much as possible.
> >
> > Now that vnd I/O to file is done though a kernel thread, can't this
> > be solved more easily ?
>
> no, more threads don't help the problems described in PR 12189.
>
> the threads do solve another problem that hasn't been previously discussed,
> which is that since the number of mutually recursive calls between vnd
> and the file system drivers is not explicitly limited anywhere, we could
> easily overflow the kernel stack. now that each entry back into vnd
> gets to start over with a fresh stack, that problem no longer exists.
>
> I think the best way to solve both of the problems I described in the PR
> is to pre-load the file's entire bmap and cache it in kernel memory
> (ie. not in the buffer cache) while configuring a vnd. writing dirty
> buffer cache buffers back to the underlying storage will then not require
> reading other buffers, thus avoiding the hidden dependencies between
> buffers that are the heart of the problem.
>
> to make this work out, we would also need to make sure that we either
> prevent (ie. fail) changes to the file's bmap while we're caching it
> outside the buffer cache, or else update the cached copy too and
> synchronize with I/O in flight (see PR 26983). either of these will
> require changing every file system driver (well, every fs that creates
> dirty buffer cache buffers).
>
> -Chuck
another way to handle this would be to give each vnd instance its own
buffer cache. that should order the dependencies such that writing one
buffer would never need to read another buffer in the same cache,
only in a lower-level cache (lower in the sense that it's the cache for
the fs that contains the backing store file for buffer being written).
we could store the info on which cache to use for a given device in its
specinfo structure, or use the existing global cache if a device doesn't
have a special cache.
this is less intrusive code-wise than my previous suggestions, but there's
some new complication in memory-management - we either need to have separate
memory dedicated for each cache, or have some mechanism for sharing the
memory allowed by the current tuning mechanism between the caches.
another possibility would be to redo the buffer memory-management code a bit.
currently the hidden dependencies are a problem because locking one buffer
can implicitly lock and write arbitrary other buffers (with the calling
sequence getblk -> allocbuf -> buf_trim -> getnewbuf -> bawrite) and writing
those other buffers might require locking a buffer that is already locked
(usually the one on which we are invoking allocbuf()). if we would avoid
trying to free up buffer space (or waiting for another thread to do this)
while holding other buffer locks, then there should not be a deadlock here.
the problem with this is that it eliminates the existing mechanism for
deterministically limiting the amount of memory used for buffers, so we
would need to do that a different way. the only way I've thought of to
do this would be to track how many buffers each thread has locked, and then
have a thread free up some buffers at the end of brelse() if it just
unlocked the only buffer it had locked. that doesn't seem very appealing,
but maybe someone else can think of a better way.
-Chuck