Subject: Re: RAID, ccd, and vinum.
To: Greg Oster <oster@cs.usask.ca>
From: Richard Rauch <rkr@olib.org>
List: netbsd-help
Date: 12/21/2004 08:44:11
On Mon, Dec 20, 2004 at 09:02:27PM -0600, Greg Oster wrote:
> Richard Rauch writes:
 [...]
> > On Mon, Dec 20, 2004 at 09:59:14AM -0600, Greg Oster wrote:
> > > Richard Rauch writes:
 [...]
> > > I don't know what warnings you were getting from CCD, so it's hard to answer 
> > > that :)
> > 
> > I figured that there weren't too many candidate messages...  (^&
> 
> But forgot that some of us are too lazy to look them up ;)

(^&


> > > > 2) I would have thought that a RAID 0 and a ccd, with the same
> > > > stripe size on the same partitions of the same disks, would perform
> > > > very nearly identically.  Yet with ccd and a stripe size of about
> > > > 1000,
> > > 
> > > This sounds... "high".  How are you managing to feed it 1000 blocks 
> > 
> > I think that these suggested numbers came from the vinum docs.  I thought
> > that ccd also suggested those numbers, but in retrospect, I can't find
> > support for that.  I might have been thinking in terms of "cylinders",
> > though, and using the disklabel cylinder size.  As I said elsewhere,
> > the "usual" 63-block reserved spot at the beginning of the disk was
> > not enough to make ccdconfig happy, so I went up to a cylinder-size as
> > disklabel reports it.  That may have been part of the impetetus for
> > the larger stripes.
> > 
> > 
> > > of whatever so that it stripes across all disks for a single IO (for 
> > > hopefully optimal performance)? :)
> > 
> > How the disk buffers sort themselves out, I don't know.  I expect that most of
> > the space on this system will be used by files starting around 100MB in size,
> > on up.  So even though bonnie++ *is* just a benchmark, it may not be a bad
> > one, with its default of 300MB for filesize.  (^&  I assume that bonnie++ is
> > using the usual stdio features of fputc() and fwrite().
> 
> 300MB is typically a bit small these days.. sizeof(RAM) is usally a 
> bit better.

Well, the NFS server has less than that, so 300MB forces the server
to move bits on and off disk.

The client (where I do most editing) has 512MB.  Even that isn't
always enough, though.  When it isn't, it needs to talk to the server.
And that happens at ethernet (100Mbit/sec) speeds, so...if the disk can
do sustained rates of 10MB or so per second on large files (bonnie++
indicates rather more than that), I'm going to wind up waiting on the
network anyway.

Short of upgrading to gigabit, or trying to juggle 2+ NICs for one
NFS mount (can that be done?) on both ends, that's where it mostly
ends.


> It's just that with restrictions like MAXPHYS, RAIDframe (and CCD) never 
> gets presented with more than 64K of data at a time...  So for very 
> large stripe sizes, you're probably only touching one disk for a 
> given IO.

(ponder)

On reflection, I see that I not only don't understand what the kernel
does with disk I/O (I don't play with the kernel sources; (^&), but
I also had a bad "fuzzy mental picture".


> > Here is one result, with ccd:
> > 
> > ccd0 (both wd1, wd2; softdep; interleave 32; 1 cable)
> [snip] 
> > Here's a run with a somewhat larger stripe size:
> > 
> > ccd0 (both wd1, wd2; softdep; interleave 1024; 1 cable)
> [snip]
> 
> 1 cable means two (both?) drives on the same cable?

Yes.  It was annotation that I tacked on while I was doing the tests.
It's a little terse.  In that case, I was curious how one cable
would differ from two if the disks were being used in tandem.  I had
run a number of single-cable tests with ccd.  I don't think that I
bothered comparing RAID with just 1 cable.


 [...]
> For further giggles, try the benchmarks using just a single disk... 

Did that one, too.  Here's a sample:

wd1a, softdep mount
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
cyclopes       300M 36367  83 37560  27 12245  11 28770  94 50729  30 158.3   1
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1216  97 +++++ +++ 14136  99  1183  94   996  99  3170  95
cyclopes,300M,36367,83,37560,27,12245,11,28770,94,50729,30,158.3,1,16,1216,97,+++++,+++,14136,99,1183,94,996,99,3170,95


Some things are signiicantly lower.  Mostly it's about the same.

(I also have a comparison-only bonnie++ run on the 4GB IBM
drive that that machine uses for /.  (^&)



 [...]
> > Well, yes.  Let me put it another way:
> > 
> > What is some of the overhead that makes RAID 0 perform significantly
> > slower (to my estimation) at seeking in these tests?
> 
> Even for RAID 0, RAIDframe constructs a directed, acyclic graph to 
> describe the IO operation.  It then traverses this graph from single 
> source to single sink, "firing" nodes along the way.  And while this 
> provides a very general way of describing disk IO, all that graph 
> creation, traversing, and teardown does take some time.

Interesting...  I'm surprised that when seeks are running on the order
of 100 to 250 per second, there is that much work for an 800MHz Athlon.



> For 2 disks in a RAID 0 config, try a stripe width of 64.  If the 
> filesystem is going to have large files on it, a block/frag setting 
> of 65536/8192 might yield quite good performance.

Thanks.  I'll give it a spin.  (^&

I had actually tried at 64 stripe size.  In fact, reviewing, it seems
that one of those performed fairly close to ccd for seeks (a little over
220 seeks/sec) and otherwise was about as good as I was going to get:

raid 0 (ffs; softdeps; 2 cables; 64 stripe)      
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
cyclopes       300M 35343  89 39401  36 15060  16 27032  93 78939  54 223.5   2
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1176  97 +++++ +++ 14202  99  1171  95   959  99  3120  93
cyclopes,300M,35343,89,39401,36,15060,16,27032,93,78939,54,223.5,2,16,1176,97,+++++,+++,14202,99,1171,95,959,99,3120,93


I didn't fool with the disklabel much.  Maybe I should have.  I did try
telling newfs to use different block-sizes.  Here's a modification of the
above, with newfs blocksize of 64K:

raid 0 (same as above, but newfs block-size of 64K)
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
cyclopes       300M 36782  89 38826  35 17878  20 27493  94 84635  56 136.1   2
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1215  98 +++++ +++ 12901  99  1197  98   959  99  3274  94
cyclopes,300M,36782,89,38826,35,17878,20,27493,94,84635,56,136.1,2,16,1215,98,+++++,+++,12901,99,1197,98,959,99,3274,94


Then, again, here's the current config over NFS (the end-result that really
affects me):

raid 0 (raid stripe of 64; softdep; normal newfs; NFS-mounted)
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
socrates       600M 11206   8 11199   1  3513   1 11355  16 11370   1 116.0   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   480   0  5581   6  3550   4   484   0  5880   5  1484   2
socrates,600M,11206,8,11199,1,3513,1,11355,16,11370,1,116.0,0,16,480,0,5581,6,3550,4,484,0,5880,5,1484,2

-- 
  "I probably don't know what I'm talking about."  http://www.olib.org/~rkr/