Subject: Re: RAID, ccd, and vinum.
To: Richard Rauch <rkr@olib.org>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-help
Date: 12/20/2004 21:02:27
Richard Rauch writes:
> (Sorry in advance if this is a little scattered...)
> 
> On Mon, Dec 20, 2004 at 09:59:14AM -0600, Greg Oster wrote:
> > Richard Rauch writes:
[snip]
> > > Other than changing the fstype to RAID (from ccd), I now have it labeled
> > > the same way.  raidctl does not raise any warnings, nor do I see warnings
> > > during newfs or mount.  So it appears that raidctl is "cleaner" with the
> > > same disklabel.
> > > 
> > > Because I get no complaints under RAID, and thought that I'd double-check
> ed
> > > my arithmetic (even triple-checked), I think that my partitions are okay.
> > > But maybe they aren't, and the wording for the warning is simply phrased
> > > very poorly.
> > > 
> > > So, question #1:
> > > 
> > >   Should I worry about those warnings with ccd---or their absence in RAID
> ?
> > 
> > I don't know what warnings you were getting from CCD, so it's hard to answer 
> > that :)
> 
> I figured that there weren't too many candidate messages...  (^&

But forgot that some of us are too lazy to look them up ;)
 
> > > 2) I would have thought that a RAID 0 and a ccd, with the same
> > > stripe size on the same partitions of the same disks, would perform
> > > very nearly identically.  Yet with ccd and a stripe size of about
> > > 1000,
> > 
> > This sounds... "high".  How are you managing to feed it 1000 blocks 
> 
> I think that these suggested numbers came from the vinum docs.  I thought
> that ccd also suggested those numbers, but in retrospect, I can't find
> support for that.  I might have been thinking in terms of "cylinders",
> though, and using the disklabel cylinder size.  As I said elsewhere,
> the "usual" 63-block reserved spot at the beginning of the disk was
> not enough to make ccdconfig happy, so I went up to a cylinder-size as
> disklabel reports it.  That may have been part of the impetetus for
> the larger stripes.
> 
> 
> > of whatever so that it stripes across all disks for a single IO (for 
> > hopefully optimal performance)? :)
> 
> How the disk buffers sort themselves out, I don't know.  I expect that most of
> the space on this system will be used by files starting around 100MB in size,
> on up.  So even though bonnie++ *is* just a benchmark, it may not be a bad
> one, with its default of 300MB for filesize.  (^&  I assume that bonnie++ is
> using the usual stdio features of fputc() and fwrite().

300MB is typically a bit small these days.. sizeof(RAM) is usally a 
bit better.

It's just that with restrictions like MAXPHYS, RAIDframe (and CCD) never 
gets presented with more than 64K of data at a time...  So for very 
large stripe sizes, you're probably only touching one disk for a 
given IO.
 
> Here is one result, with ccd:
> 
> ccd0 (both wd1, wd2; softdep; interleave 32; 1 cable)
[snip] 
> Here's a run with a somewhat larger stripe size:
> 
> ccd0 (both wd1, wd2; softdep; interleave 1024; 1 cable)
[snip]

1 cable means two (both?) drives on the same cable?

> Most of the variation is within the range of what you'd get for a single run.
>   But
> the Sequential Input/Block column, and Sequential Output/Rewrite numbers
> both seem very positively affected by this.

Except I think you'd be stuffing 1024 blocks onto one disk before 
you'd start stuffing blocks onto the next, and if you only hand the 
ccd 128 blocks at a time, that means you're only (usually) ever 
touching the one disk.
 
> 
> Finally, for giggles, the 1024-sized-stripe ccd0, with each drive on its own
> cable (which is how I intended to ultimately set the drives, but I wanted
> to see how their performance would be affected by sharing a cable):
> 
> ccd (ffs, softdep; 2 cables; 1024  interleave)
[snip]
> ...big spike in seek performance; possibly significant drop in some
> other areas (Sequential Input/Block down 5.6MB/sec, Sequential
> deletion down 3000 files/sec.)

For further giggles, try the benchmarks using just a single disk... 

> > > I was getting (according to bonnie++) a solid 250 (248 to
> > > 268 range, I think) seeks per second.  With the same stripe size
> > > on a RAID 0, I was getting 200 or so (in the best config, upper
> > > 190 to 220 range; others more routinely around 160).  With RAID 0,
> > > there is essentially no overhead for computing parity.
> > 
> > But there is a lot more overhead for other stuff... however: with a 
> 
> Even RAID 0? 

Yes.  (more below)

> I assumed that the extra stuff was skipped around
> pretty quickly when there's no parity to compute.
> 
> For the most part, RAID 0 and ccd perform very similarly for me.
> Other than the issue of seeks, the main differences that I see are:
> RAID can auto config, and ccd emits extra warnings that have me
> worried.  (Notice that these aren't performance differences.
> (^&)
> 
> 
> > "stripe size" of 1000 (not sure if that's total sectors per entire 
> > stripe, or per stripe width, or what :-} ) I'm guessing that this 
> > RAID set isn't performing anywhere close to optimal. :)
> 
> Probably not.  But as long as I am close to the limit of what I
> can expect an NFS server to deliver over 100 Mbs ethernet, I
> won't worry too much about whether it might be optimal for the
> occasional local access.
> 
> 
> > > So, question #2:
> > > 
> > >   Why is there such a disparity (ahem) between the two benchmarks?
> > 
> > RAIDframe has way more overhead when one is (effectively) writing to 
> > just a single disk?
> 
> Well, yes.  Let me put it another way:
> 
> What is some of the overhead that makes RAID 0 perform significantly
> slower (to my estimation) at seeking in these tests?

Even for RAID 0, RAIDframe constructs a directed, acyclic graph to 
describe the IO operation.  It then traverses this graph from single 
source to single sink, "firing" nodes along the way.  And while this 
provides a very general way of describing disk IO, all that graph 
creation, traversing, and teardown does take some time.

For 2 disks in a RAID 0 config, try a stripe width of 64.  If the 
filesystem is going to have large files on it, a block/frag setting 
of 65536/8192 might yield quite good performance.

[snip]

Later...

Greg Oster