tech-kern: Re: yamt-readahead branch

Subject: Re: yamt-readahead branch
To: None <chuq@chuq.com>
From: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
List: tech-kern
Date: 11/16/2005 12:44:36
>  - I don't think that having the read ahead state in struct file
>    is the way to go, for a couple reasons:
> 
> 	- it doesn't handle multi-threaded programs, nor groups of processes
> 	  that inherit file descriptors from a common parent.
> 	- it doesn't handle doing read ahead from page-faults.
> 
>    I think it would be much better to keep the read ahead state in the
>    vnode (or rather the genfs_node) and detect multiple read patterns
>    heuristically.  in the common case, there's only one sequential pattern
>    so detecting that is really easy.

i made it per-file because:

	- fadvise() is per-file.  (well, it is per-range actually.)

	- i think that a file is a good enough approximate of a requester
	  in common cases.

	- uvm_fault() has its own read-ahead mechanism.
	  and we can easily fall back to per-vnode or per-mapping context
	  if desirable.

	- building a heuristic which can handle multiple stream is hard. :-)
	  do you have any good idea?

>  - as for the read ahead policy, we should have a sliding-window kind of
>    scheme, such that we can keep multiple disk I/Os pending in the disk driver
>    all the time (or at least until the application takes a breather).
>    ie. we shouldn't wait for the application to copy out all the data that
>    we've read ahead before starting more I/Os.  (actually on second look,
>    I think you already do this, but it's hard to tell since you didn't
>    describe your algorithm at all.)

yes, it's my intent.

>  - there needs to be some feedback for scaling back the amount of read ahead
>    that we do in case memory gets low.  otherwise we'll have cases where we
>    do a bunch of read ahead, but the memory is reclaimed for a different
>    purpose before the application catches up.
> 
>  - I see you have some XXX comments about tuning the amount of data to
>    read ahead based on physical memory, which is true, but we should also
>    tune based on the I/O throughput of the underlying device.  we want to
>    be able to keep any device 100% busy, ideally without the user needing
>    to configure this manually.  but we'll need some way to allow manual
>    per-device tuning as well.

sure.
they should be on a todo list, but not for this branch.

sorry for not being clear, finding the best algorithm is not a goal of
this branch.  uvm_readahead.c is merely a "sample algorithm".
i think it works fine enough for common cases, tho.

or, do you think they can't be done with the current uvm_ra_* api?

> and some comments on the implementation:

sure.  i'll change them.

YAMAMOTO Takashi