Subject: Re: Extension of fsync_range() to permit forcing disk cache flushing
To: J Chapman Flack <flack@cs.purdue.edu>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 12/16/2004 13:56:02
--Wb5NtZlyOqqy58h0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Thu, Dec 16, 2004 at 03:40:19PM -0500, J Chapman Flack wrote:
> Bill Studenmund writes:
> > I have an application that wants to be able to know that certain writes=
=20
> > have been forced to permanent storage - that they aren't still sitting =
in=20
> > the disk's write cache. This idea is similar to the current thread abou=
t=20
> >...
> > After discussing this with some developers, the best solution seems to =
be=20
> > to add a flag to fsync_range() to force this behavior. Then pass a flag
>=20
> What would be the performance hit in making this the /default/
> behavior of fsync and fsync_range? I ask only because I strongly
My understanding is it would be severe. This idea was in fact my first=20
thought. But I received a strong negative reaction, and so I came up with=
=20
this proposal.
Note: I do not have strong feelings on this, however I really do not want=
=20
to get caught between different camps with strong opinions I can't really=
=20
argue. :-)
=46rom what I was told, adding cache flushing to fsync would have an over
50% decrease in mail transfer rate on a mail server. It's been tried and
it stank. Thus I think I'd be handed my scalp if I tried to force fsync to
cache-flush.
> suspect that many programmers who have used these functions in their code
> have done so on the belief that they do this already - a quite reasonable
> belief when the man page says "causes all modified data ... to be written
> to permanent storage." It seems more conservative to make sure that code
Note: I think you missed a word in there, and that word is important.=20
Either that or we updated the man page. My man page says "permanent=20
storage device." fsync and fsync_range DO ensure the data made it to the=20
disk device.
The question is what do we let the device do with it.
> programmers have written expecting it to be safe actually is, than to add
> a flag they can edit back into their code if they really meant to say
> what they thought they were already saying. The inverse flag - "send this
> as far as the disk buffer but don't sweat the cache flush" - could then be
> used with deliberation by those who really understand the implications
> and are willing to accept them (though I'm not completely sure what the
> point of such an operation would be - it takes longer than doing nothing,
> but still gives you no assurance the bits are on disk).
The reason for doing things the way I propose is that getting things to=20
the disk usually is good enough. If you're writing a lot of data, you'll=20
probably flush the cache out very quickly. Also, if you really care about=
=20
the disk cache you can:
1) turn the write-back caches off
2) have your system on a UPS, and when house power drops, you start a=20
shutdown. And sync the disk caches on the way down.
If instead we force fsync() and fsync_range() to always clear the cache,
we prevent admins from being able to decide what is safe-enough.
Also, you're implicitly assuming that disks don't fail. If an admin has
taken steps, either in configuration or product choice (good UPS or RAID
box w/ battery backup for cache), to ensure that writing to the disk drive
is as good as writing to the media, why penalize him or her by forcing
cache flushing. And since disks fail (even in RAID), all we have to do is
make sure that the cache failure probability is less than the disk drive
failure probability, and then the cache doesn't matter.
Other thoughts?
The only other comment I've gotten is that I used "long" in places where I=
=20
should use "int".
Take care,
Bill
--Wb5NtZlyOqqy58h0
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)
iD8DBQFBwgRxWz+3JHUci9cRAsQIAJ4nQcw57iTBohlqwAtf/TM/1J2tWACbBJSt
XFTXhtd5oYJjjG9GszJvL9Q=
=0lE2
-----END PGP SIGNATURE-----
--Wb5NtZlyOqqy58h0--