Subject: Re: RAID, ccd, and vinum.
To: Greg Oster <oster@cs.usask.ca>
From: Richard Rauch <rkr@olib.org>
List: netbsd-help
Date: 12/20/2004 18:32:21
(Sorry in advance if this is a little scattered...)

On Mon, Dec 20, 2004 at 09:59:14AM -0600, Greg Oster wrote:
> Richard Rauch writes:
> > I've been playing around with RAID and ccd---and made a pass at vinum,
> > but vinum required an unspecified kernel option, it seems.  (I had used
> > a kernel with the one vinum option in GENERIC enabled, but still got
> > errors with /dev/vinum/control or whatever (an extant node) not being
> > configured.  Unless I somehow rebooted with the wrong kernel, I figured
> > that something in that under-cooked a condition was not a pressing need.
> > (^&)
> > 
> > ccd and RAID both look like they are servicable.  I did have two questions
> > about them:
> > 
> > 
> > 1) With ccd, I *always* got kernel messages.  At first, I only left 63
> > blocks at the front for the disklabel, as "real" disks use.  Then
> > after re-reading the documentation and googling around, I decided to try
> > using the more or less fictional cylinder size reported by disklabel
> > (1008).  Either I couldn't do arithmetic (possible) or that was still
> > not enough, so I bumped it up to a larger, slightly rounder number.
> > Now, ccdconfig and friends no longer complain in most ways.  Except for
> > a curious other warning: It now seems to be warning me that my disk label
> > is not using the entire disk.  (Well, duh.  It threw a hissy fit when I
> > used the whole disk!  (^&  Besides, as somone in a previous mailing list
> > message observed, one may want to use a portion of a disk in a ccd, and
> > mount other parts as more conventional filesystems.)
> 
> What kernel messages were you seeing from ccd?

The complaints that appear to object to my not using the whole disk are:

WARNING: ccd0: total sector size in disklabel (468879360) != the size of ccd (468880000).


> > Other than changing the fstype to RAID (from ccd), I now have it labeled
> > the same way.  raidctl does not raise any warnings, nor do I see warnings
> > during newfs or mount.  So it appears that raidctl is "cleaner" with the
> > same disklabel.
> > 
> > Because I get no complaints under RAID, and thought that I'd double-checked
> > my arithmetic (even triple-checked), I think that my partitions are okay.
> > But maybe they aren't, and the wording for the warning is simply phrased
> > very poorly.
> > 
> > So, question #1:
> > 
> >   Should I worry about those warnings with ccd---or their absence in RAID?
> 
> I don't know what warnings you were getting from CCD, so it's hard to answer 
> that :)

I figured that there weren't too many candidate messages...  (^&


> > 2) I would have thought that a RAID 0 and a ccd, with the same
> > stripe size on the same partitions of the same disks, would perform
> > very nearly identically.  Yet with ccd and a stripe size of about
> > 1000,
> 
> This sounds... "high".  How are you managing to feed it 1000 blocks 

I think that these suggested numbers came from the vinum docs.  I thought
that ccd also suggested those numbers, but in retrospect, I can't find
support for that.  I might have been thinking in terms of "cylinders",
though, and using the disklabel cylinder size.  As I said elsewhere,
the "usual" 63-block reserved spot at the beginning of the disk was
not enough to make ccdconfig happy, so I went up to a cylinder-size as
disklabel reports it.  That may have been part of the impetetus for
the larger stripes.


> of whatever so that it stripes across all disks for a single IO (for 
> hopefully optimal performance)? :)

How the disk buffers sort themselves out, I don't know.  I expect that most of
the space on this system will be used by files starting around 100MB in size,
on up.  So even though bonnie++ *is* just a benchmark, it may not be a bad
one, with its default of 300MB for filesize.  (^&  I assume that bonnie++ is
using the usual stdio features of fputc() and fwrite().


Here is one result, with ccd:

ccd0 (both wd1, wd2; softdep; interleave 32; 1 cable)
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
cyclopes       300M 34896  85 37222  43 11175  11 27820  91 69783  44 129.5   1
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1229  96 +++++ +++ 15317  99  1168  93   973  99  3374  99
cyclopes,300M,34896,85,37222,43,11175,11,27820,91,69783,44,129.5,1,16,1229,96,+++++,+++,15317,99,1168,93,973,99,3374,99


Here's a run with a somewhat larger stripe size:

ccd0 (both wd1, wd2; softdep; interleave 1024; 1 cable)
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
cyclopes       300M 37110  89 39461  35 20663  20 29225  94 78421  48 149.9   1
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1215  96 +++++ +++ 14980  99  1166  93   979  99  3233  94
cyclopes,300M,37110,89,39461,35,20663,20,29225,94,78421,48,149.9,1,16,1215,96,+++++,+++,14980,99,1166,93,979,99,3233,94


Most of the variation is within the range of what you'd get for a single run.  But
the Sequential Input/Block column, and Sequential Output/Rewrite numbers
both seem very positively affected by this.


Finally, for giggles, the 1024-sized-stripe ccd0, with each drive on its own
cable (which is how I intended to ultimately set the drives, but I wanted
to see how their performance would be affected by sharing a cable):

ccd (ffs, softdep; 2 cables; 1024  interleave)
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
cyclopes       300M 37785  85 40555  30 20317  20 28776  94 72832  44 253.6   2
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1209  96 +++++ +++ 11963  85  1210  95  1013  99  3303  99
cyclopes,300M,37785,85,40555,30,20317,20,28776,94,72832,44,253.6,2,16,1209,96,+++++,+++,11963,85,1210,95,1013,99,3303,99

...big spike in seek performance; possibly significant drop in some
other areas (Sequential Input/Block down 5.6MB/sec, Sequential
deletion down 3000 files/sec.)


> > I was getting (according to bonnie++) a solid 250 (248 to
> > 268 range, I think) seeks per second.  With the same stripe size
> > on a RAID 0, I was getting 200 or so (in the best config, upper
> > 190 to 220 range; others more routinely around 160).  With RAID 0,
> > there is essentially no overhead for computing parity.
> 
> But there is a lot more overhead for other stuff... however: with a 

Even RAID 0?  I assumed that the extra stuff was skipped around
pretty quickly when there's no parity to compute.

For the most part, RAID 0 and ccd perform very similarly for me.
Other than the issue of seeks, the main differences that I see are:
RAID can auto config, and ccd emits extra warnings that have me
worried.  (Notice that these aren't performance differences.
(^&)


> "stripe size" of 1000 (not sure if that's total sectors per entire 
> stripe, or per stripe width, or what :-} ) I'm guessing that this 
> RAID set isn't performing anywhere close to optimal. :)

Probably not.  But as long as I am close to the limit of what I
can expect an NFS server to deliver over 100 Mbs ethernet, I
won't worry too much about whether it might be optimal for the
occasional local access.


> > So, question #2:
> > 
> >   Why is there such a disparity (ahem) between the two benchmarks?
> 
> RAIDframe has way more overhead when one is (effectively) writing to 
> just a single disk?

Well, yes.  Let me put it another way:

What is some of the overhead that makes RAID 0 perform significantly
slower (to my estimation) at seeking in these tests?


> > The disks were newfsed the same, using ffs.  No use of tunefs was made.
> > (I tried lfs, for giggles, but due to comments about stability when
> > lfs gets around 70% full or so, I will stick to ffs.)
> > 
> > 
> > If there is interest, I can post the bonnie++ results from 25 to 30
> > runs, including notes about the configuration of the disks.  It's not
> > a huge pool of samples, and few runs were repeated on a single
> > configuration.  But it may be of interest.  Or not.  (^&
> 
> It is of interest, assuming there are some stripes in there 
> that are in the 64K of data per stripe range :)

There should be.  I'll see about cleaning up the summaries of the system
configurations a bit.


-- 
  "I probably don't know what I'm talking about."  http://www.olib.org/~rkr/