Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-bugs
Date: 07/06/2005 15:20:02
The following reply was made to PR kern/30674; it has been noted by GNATS.
From: Greg Oster <oster@cs.usask.ca>
To: gnats-bugs@netbsd.org, Matthias Scheler <tron@colwyn.zhadum.de>
Cc:
Subject: Re: kern/30674: RAIDframe should be able to create volumes without parity rewrite
Date: Wed, 06 Jul 2005 09:19:19 -0600
Matthias Scheler writes:
> >Number: 30674
> >Category: kern
> >Synopsis: RAIDframe should be able to create volumes without parity re
> write
> >Confidential: no
> >Severity: non-critical
> >Priority: medium
> >Responsible: kern-bug-people
> >State: open
> >Class: change-request
> >Submitter-Id: net
> >Arrival-Date: Wed Jul 06 09:52:00 +0000 2005
> >Originator: Matthias Scheler
> >Release: NetBSD 3.99.7
> >Organization:
> Matthias Scheler http://scheler.de/~matthias
> /
> >Environment:
> System: NetBSD lyssa.zhadum.de 3.99.7 NetBSD 3.99.7 (LYSSA) #0: Mon Jul 4 10:
> 16:28 BST 2005 tron@lyssa.zhadum.de:/src/sys/compile/LYSSA i386
> Architecture: i386
> Machine: i386
> >Description:
> Setting up a RAIDframe volume requires an initial parity rewrite which
> can take a long time. This is completely pointless because the volume
> doesn't contain any data yet.
Let's address the RAID 1 case first:
If you're just going to build a FFS on it, then one can get away with
marking the parity as "good" because data will never be read until
after it has been written. Fine. If the machine crashes or
otherwise goes down without marking the parity as "good", then you are
back to square one -- you *HAVE* to do the parity rebuild at that
point, since you have no guarantee that there were no writes in
progress, and that for a given sector that the primary and the mirror
are in sync. So the only thing you've saved is the initial rebuild
(and there's nothing saying you can't do that initial rebuild in the
background sometime after you're using the partition).
There is, however, also a violation of the Principle of Least Astonishment.
If, for example, the components had random data on them before the
RAID 1 set was created, and one does two "dd if=/dev/rraid0d | md5"
with the parity marked as "good" (but not actually synced!) then one
might well yield different results. One certainly does not expect a
"disk device" to return different data on subsequent reads! (RAIDframe
will pick either the master or the mirror to read from -- in cases where
data is already written, this won't be a problem. In cases where data
has not been written to that sector, but we are still claiming that
the parity is good, it will violate the PoLA.)
Let's now look at the RAID 5 case: Consider a stripe made up of
component blocks A, B, C, D, and E. Let A be the block being updated,
and E be the parity for the stripe. Let E not be the XOR of A+B+C+D,
which will be the case if the parity rewrite is not done.
To do a write of A, the old contents of A will be read, the current
contents of E will be read, a new E will be computed, and the new A
and new E will be written. In the event that A fails, there is now
no way of reconstructing the contents of A, since B, C, and D were
never in sync with E, and thus are useless in recomputing A. For a
RAID 5 case, one *MUST* rebuild the parity before live data is put on
the RAID set, as otherwise there will be no way of reconstructing
data in the event of a component failure.
I've heard the argument a couple of times, but I don't see it buying
anything other than removing one parity rebuild...
Further comments? As you can guess, I'm not seeing any real advantage to
creating volumes without parity rewrites, even for RAID 1 sets.
Later...
Greg Oster