Subject: Re: yamt-readahead branch
To: None <chuq@chuq.com>
From: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
List: tech-kern
Date: 11/16/2005 12:44:36
> - I don't think that having the read ahead state in struct file
> is the way to go, for a couple reasons:
>
> - it doesn't handle multi-threaded programs, nor groups of processes
> that inherit file descriptors from a common parent.
> - it doesn't handle doing read ahead from page-faults.
>
> I think it would be much better to keep the read ahead state in the
> vnode (or rather the genfs_node) and detect multiple read patterns
> heuristically. in the common case, there's only one sequential pattern
> so detecting that is really easy.
i made it per-file because:
- fadvise() is per-file. (well, it is per-range actually.)
- i think that a file is a good enough approximate of a requester
in common cases.
- uvm_fault() has its own read-ahead mechanism.
and we can easily fall back to per-vnode or per-mapping context
if desirable.
- building a heuristic which can handle multiple stream is hard. :-)
do you have any good idea?
> - as for the read ahead policy, we should have a sliding-window kind of
> scheme, such that we can keep multiple disk I/Os pending in the disk driver
> all the time (or at least until the application takes a breather).
> ie. we shouldn't wait for the application to copy out all the data that
> we've read ahead before starting more I/Os. (actually on second look,
> I think you already do this, but it's hard to tell since you didn't
> describe your algorithm at all.)
yes, it's my intent.
> - there needs to be some feedback for scaling back the amount of read ahead
> that we do in case memory gets low. otherwise we'll have cases where we
> do a bunch of read ahead, but the memory is reclaimed for a different
> purpose before the application catches up.
>
> - I see you have some XXX comments about tuning the amount of data to
> read ahead based on physical memory, which is true, but we should also
> tune based on the I/O throughput of the underlying device. we want to
> be able to keep any device 100% busy, ideally without the user needing
> to configure this manually. but we'll need some way to allow manual
> per-device tuning as well.
sure.
they should be on a todo list, but not for this branch.
sorry for not being clear, finding the best algorithm is not a goal of
this branch. uvm_readahead.c is merely a "sample algorithm".
i think it works fine enough for common cases, tho.
or, do you think they can't be done with the current uvm_ra_* api?
> and some comments on the implementation:
sure. i'll change them.
YAMAMOTO Takashi