Subject: Re: RAID, ccd, and vinum.
To: Richard Rauch <rkr@olib.org>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-help
Date: 12/20/2004 21:02:27
Richard Rauch writes:
> (Sorry in advance if this is a little scattered...)
>
> On Mon, Dec 20, 2004 at 09:59:14AM -0600, Greg Oster wrote:
> > Richard Rauch writes:
[snip]
> > > Other than changing the fstype to RAID (from ccd), I now have it labeled
> > > the same way. raidctl does not raise any warnings, nor do I see warnings
> > > during newfs or mount. So it appears that raidctl is "cleaner" with the
> > > same disklabel.
> > >
> > > Because I get no complaints under RAID, and thought that I'd double-check
> ed
> > > my arithmetic (even triple-checked), I think that my partitions are okay.
> > > But maybe they aren't, and the wording for the warning is simply phrased
> > > very poorly.
> > >
> > > So, question #1:
> > >
> > > Should I worry about those warnings with ccd---or their absence in RAID
> ?
> >
> > I don't know what warnings you were getting from CCD, so it's hard to answer
> > that :)
>
> I figured that there weren't too many candidate messages... (^&
But forgot that some of us are too lazy to look them up ;)
> > > 2) I would have thought that a RAID 0 and a ccd, with the same
> > > stripe size on the same partitions of the same disks, would perform
> > > very nearly identically. Yet with ccd and a stripe size of about
> > > 1000,
> >
> > This sounds... "high". How are you managing to feed it 1000 blocks
>
> I think that these suggested numbers came from the vinum docs. I thought
> that ccd also suggested those numbers, but in retrospect, I can't find
> support for that. I might have been thinking in terms of "cylinders",
> though, and using the disklabel cylinder size. As I said elsewhere,
> the "usual" 63-block reserved spot at the beginning of the disk was
> not enough to make ccdconfig happy, so I went up to a cylinder-size as
> disklabel reports it. That may have been part of the impetetus for
> the larger stripes.
>
>
> > of whatever so that it stripes across all disks for a single IO (for
> > hopefully optimal performance)? :)
>
> How the disk buffers sort themselves out, I don't know. I expect that most of
> the space on this system will be used by files starting around 100MB in size,
> on up. So even though bonnie++ *is* just a benchmark, it may not be a bad
> one, with its default of 300MB for filesize. (^& I assume that bonnie++ is
> using the usual stdio features of fputc() and fwrite().
300MB is typically a bit small these days.. sizeof(RAM) is usally a
bit better.
It's just that with restrictions like MAXPHYS, RAIDframe (and CCD) never
gets presented with more than 64K of data at a time... So for very
large stripe sizes, you're probably only touching one disk for a
given IO.
> Here is one result, with ccd:
>
> ccd0 (both wd1, wd2; softdep; interleave 32; 1 cable)
[snip]
> Here's a run with a somewhat larger stripe size:
>
> ccd0 (both wd1, wd2; softdep; interleave 1024; 1 cable)
[snip]
1 cable means two (both?) drives on the same cable?
> Most of the variation is within the range of what you'd get for a single run.
> But
> the Sequential Input/Block column, and Sequential Output/Rewrite numbers
> both seem very positively affected by this.
Except I think you'd be stuffing 1024 blocks onto one disk before
you'd start stuffing blocks onto the next, and if you only hand the
ccd 128 blocks at a time, that means you're only (usually) ever
touching the one disk.
>
> Finally, for giggles, the 1024-sized-stripe ccd0, with each drive on its own
> cable (which is how I intended to ultimately set the drives, but I wanted
> to see how their performance would be affected by sharing a cable):
>
> ccd (ffs, softdep; 2 cables; 1024 interleave)
[snip]
> ...big spike in seek performance; possibly significant drop in some
> other areas (Sequential Input/Block down 5.6MB/sec, Sequential
> deletion down 3000 files/sec.)
For further giggles, try the benchmarks using just a single disk...
> > > I was getting (according to bonnie++) a solid 250 (248 to
> > > 268 range, I think) seeks per second. With the same stripe size
> > > on a RAID 0, I was getting 200 or so (in the best config, upper
> > > 190 to 220 range; others more routinely around 160). With RAID 0,
> > > there is essentially no overhead for computing parity.
> >
> > But there is a lot more overhead for other stuff... however: with a
>
> Even RAID 0?
Yes. (more below)
> I assumed that the extra stuff was skipped around
> pretty quickly when there's no parity to compute.
>
> For the most part, RAID 0 and ccd perform very similarly for me.
> Other than the issue of seeks, the main differences that I see are:
> RAID can auto config, and ccd emits extra warnings that have me
> worried. (Notice that these aren't performance differences.
> (^&)
>
>
> > "stripe size" of 1000 (not sure if that's total sectors per entire
> > stripe, or per stripe width, or what :-} ) I'm guessing that this
> > RAID set isn't performing anywhere close to optimal. :)
>
> Probably not. But as long as I am close to the limit of what I
> can expect an NFS server to deliver over 100 Mbs ethernet, I
> won't worry too much about whether it might be optimal for the
> occasional local access.
>
>
> > > So, question #2:
> > >
> > > Why is there such a disparity (ahem) between the two benchmarks?
> >
> > RAIDframe has way more overhead when one is (effectively) writing to
> > just a single disk?
>
> Well, yes. Let me put it another way:
>
> What is some of the overhead that makes RAID 0 perform significantly
> slower (to my estimation) at seeking in these tests?
Even for RAID 0, RAIDframe constructs a directed, acyclic graph to
describe the IO operation. It then traverses this graph from single
source to single sink, "firing" nodes along the way. And while this
provides a very general way of describing disk IO, all that graph
creation, traversing, and teardown does take some time.
For 2 disks in a RAID 0 config, try a stripe width of 64. If the
filesystem is going to have large files on it, a block/frag setting
of 65536/8192 might yield quite good performance.
[snip]
Later...
Greg Oster