Subject: Re: RAID, ccd, and vinum.
To: Greg Oster <oster@cs.usask.ca>
From: Richard Rauch <rkr@olib.org>
List: netbsd-help
Date: 12/20/2004 18:32:21
(Sorry in advance if this is a little scattered...)
On Mon, Dec 20, 2004 at 09:59:14AM -0600, Greg Oster wrote:
> Richard Rauch writes:
> > I've been playing around with RAID and ccd---and made a pass at vinum,
> > but vinum required an unspecified kernel option, it seems. (I had used
> > a kernel with the one vinum option in GENERIC enabled, but still got
> > errors with /dev/vinum/control or whatever (an extant node) not being
> > configured. Unless I somehow rebooted with the wrong kernel, I figured
> > that something in that under-cooked a condition was not a pressing need.
> > (^&)
> >
> > ccd and RAID both look like they are servicable. I did have two questions
> > about them:
> >
> >
> > 1) With ccd, I *always* got kernel messages. At first, I only left 63
> > blocks at the front for the disklabel, as "real" disks use. Then
> > after re-reading the documentation and googling around, I decided to try
> > using the more or less fictional cylinder size reported by disklabel
> > (1008). Either I couldn't do arithmetic (possible) or that was still
> > not enough, so I bumped it up to a larger, slightly rounder number.
> > Now, ccdconfig and friends no longer complain in most ways. Except for
> > a curious other warning: It now seems to be warning me that my disk label
> > is not using the entire disk. (Well, duh. It threw a hissy fit when I
> > used the whole disk! (^& Besides, as somone in a previous mailing list
> > message observed, one may want to use a portion of a disk in a ccd, and
> > mount other parts as more conventional filesystems.)
>
> What kernel messages were you seeing from ccd?
The complaints that appear to object to my not using the whole disk are:
WARNING: ccd0: total sector size in disklabel (468879360) != the size of ccd (468880000).
> > Other than changing the fstype to RAID (from ccd), I now have it labeled
> > the same way. raidctl does not raise any warnings, nor do I see warnings
> > during newfs or mount. So it appears that raidctl is "cleaner" with the
> > same disklabel.
> >
> > Because I get no complaints under RAID, and thought that I'd double-checked
> > my arithmetic (even triple-checked), I think that my partitions are okay.
> > But maybe they aren't, and the wording for the warning is simply phrased
> > very poorly.
> >
> > So, question #1:
> >
> > Should I worry about those warnings with ccd---or their absence in RAID?
>
> I don't know what warnings you were getting from CCD, so it's hard to answer
> that :)
I figured that there weren't too many candidate messages... (^&
> > 2) I would have thought that a RAID 0 and a ccd, with the same
> > stripe size on the same partitions of the same disks, would perform
> > very nearly identically. Yet with ccd and a stripe size of about
> > 1000,
>
> This sounds... "high". How are you managing to feed it 1000 blocks
I think that these suggested numbers came from the vinum docs. I thought
that ccd also suggested those numbers, but in retrospect, I can't find
support for that. I might have been thinking in terms of "cylinders",
though, and using the disklabel cylinder size. As I said elsewhere,
the "usual" 63-block reserved spot at the beginning of the disk was
not enough to make ccdconfig happy, so I went up to a cylinder-size as
disklabel reports it. That may have been part of the impetetus for
the larger stripes.
> of whatever so that it stripes across all disks for a single IO (for
> hopefully optimal performance)? :)
How the disk buffers sort themselves out, I don't know. I expect that most of
the space on this system will be used by files starting around 100MB in size,
on up. So even though bonnie++ *is* just a benchmark, it may not be a bad
one, with its default of 300MB for filesize. (^& I assume that bonnie++ is
using the usual stdio features of fputc() and fwrite().
Here is one result, with ccd:
ccd0 (both wd1, wd2; softdep; interleave 32; 1 cable)
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
cyclopes 300M 34896 85 37222 43 11175 11 27820 91 69783 44 129.5 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 1229 96 +++++ +++ 15317 99 1168 93 973 99 3374 99
cyclopes,300M,34896,85,37222,43,11175,11,27820,91,69783,44,129.5,1,16,1229,96,+++++,+++,15317,99,1168,93,973,99,3374,99
Here's a run with a somewhat larger stripe size:
ccd0 (both wd1, wd2; softdep; interleave 1024; 1 cable)
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
cyclopes 300M 37110 89 39461 35 20663 20 29225 94 78421 48 149.9 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 1215 96 +++++ +++ 14980 99 1166 93 979 99 3233 94
cyclopes,300M,37110,89,39461,35,20663,20,29225,94,78421,48,149.9,1,16,1215,96,+++++,+++,14980,99,1166,93,979,99,3233,94
Most of the variation is within the range of what you'd get for a single run. But
the Sequential Input/Block column, and Sequential Output/Rewrite numbers
both seem very positively affected by this.
Finally, for giggles, the 1024-sized-stripe ccd0, with each drive on its own
cable (which is how I intended to ultimately set the drives, but I wanted
to see how their performance would be affected by sharing a cable):
ccd (ffs, softdep; 2 cables; 1024 interleave)
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
cyclopes 300M 37785 85 40555 30 20317 20 28776 94 72832 44 253.6 2
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 1209 96 +++++ +++ 11963 85 1210 95 1013 99 3303 99
cyclopes,300M,37785,85,40555,30,20317,20,28776,94,72832,44,253.6,2,16,1209,96,+++++,+++,11963,85,1210,95,1013,99,3303,99
...big spike in seek performance; possibly significant drop in some
other areas (Sequential Input/Block down 5.6MB/sec, Sequential
deletion down 3000 files/sec.)
> > I was getting (according to bonnie++) a solid 250 (248 to
> > 268 range, I think) seeks per second. With the same stripe size
> > on a RAID 0, I was getting 200 or so (in the best config, upper
> > 190 to 220 range; others more routinely around 160). With RAID 0,
> > there is essentially no overhead for computing parity.
>
> But there is a lot more overhead for other stuff... however: with a
Even RAID 0? I assumed that the extra stuff was skipped around
pretty quickly when there's no parity to compute.
For the most part, RAID 0 and ccd perform very similarly for me.
Other than the issue of seeks, the main differences that I see are:
RAID can auto config, and ccd emits extra warnings that have me
worried. (Notice that these aren't performance differences.
(^&)
> "stripe size" of 1000 (not sure if that's total sectors per entire
> stripe, or per stripe width, or what :-} ) I'm guessing that this
> RAID set isn't performing anywhere close to optimal. :)
Probably not. But as long as I am close to the limit of what I
can expect an NFS server to deliver over 100 Mbs ethernet, I
won't worry too much about whether it might be optimal for the
occasional local access.
> > So, question #2:
> >
> > Why is there such a disparity (ahem) between the two benchmarks?
>
> RAIDframe has way more overhead when one is (effectively) writing to
> just a single disk?
Well, yes. Let me put it another way:
What is some of the overhead that makes RAID 0 perform significantly
slower (to my estimation) at seeking in these tests?
> > The disks were newfsed the same, using ffs. No use of tunefs was made.
> > (I tried lfs, for giggles, but due to comments about stability when
> > lfs gets around 70% full or so, I will stick to ffs.)
> >
> >
> > If there is interest, I can post the bonnie++ results from 25 to 30
> > runs, including notes about the configuration of the disks. It's not
> > a huge pool of samples, and few runs were repeated on a single
> > configuration. But it may be of interest. Or not. (^&
>
> It is of interest, assuming there are some stripes in there
> that are in the 64K of data per stripe range :)
There should be. I'll see about cleaning up the summaries of the system
configurations a bit.
--
"I probably don't know what I'm talking about." http://www.olib.org/~rkr/