Subject: Re: Disk scheduling policy (Re: NEW_BUFQ_STRATEGY)
To: Alfred Perlstein <bright@mu.org>
From: Robert Elz <kre@munnari.OZ.AU>
List: tech-kern
Date: 12/02/2003 18:16:02
    Date:        Mon, 1 Dec 2003 15:29:37 -0800
    From:        Alfred Perlstein <bright@mu.org>
    Message-ID:  <20031201232937.GJ75620@elvis.mu.org>

  | That can lead to starvation.  The best bet is to keep it simple and
  | make sure that you sweep back and forth, never doubling back.

Actually, I suspect that's the theoretical best bet, and not the
unix filesystem best bet.

There was a lot of work on i/o scheduling back for unix in the 70's
(back in the days when drives were slow, and buffer caches miniscule, so
waiting for i/o was just about all a busy system would ever do).

It turns (from experiments then, which may no longer be valid, those were
the days when "read block N" meant exactly that) out that reading backwards
through the drive gives sub-optimal results.  The best bet is usually to
sweep from 0 .. N, then return to 0 and start all over again.

The reason is obvious once it has been explained ... the vast majority of
I/O is sequential, the filesystem does read-ahead, but returning the read
ahead blocks before the block the process actually wanted is a waste of
time, nothing wants it yet.   Worse than that, it can actually cause the
block read ahead to be flushed (due to non-use) before the process gets a
chance to access it.

Filesystems of the time (unlike current ones) did not necessarily allocate
blocks of a file in increasing order through the drive, so this may seem
strange - but another system change was to cause that to happen, actually
keep the free list (roughly) sorted so blocks allocated to a file sequentially
written (most of them) would go to the disc in mostly increasing block
number order.   FFS (and I believe, LFS) does that now - precisely to
allow better read algorithms.

In any case, be wary of assuming that theoretical algorithms, which mostly
assume random arrival of requests (with some distribution or other) will
apply to a unix system - the blocks requested are not random, or anything
like it.

And speaking of I/O and readahead ... another change with dramatic
performance benefits was individual per/file readahead tuning - that
is, increasing the readahead amount for a file (for a particular
instance of reading that file) until i/o requests never block (or
some upper limit is reached) - as long as i/o remains sequential, and
i/o keeps blocking waiting for the data to appear, the readahead is
increased. (It was a bit more than that, but you get the idea).

For completeness, much of this work was done by Tim Long at the University
of Sydney (along wth Piers Lauder and Chris Maltby).

kre