Subject: Re: RAID, ccd, and vinum.
To: Greg Oster <oster@cs.usask.ca>
From: Richard Rauch <rkr@olib.org>
List: netbsd-help
Date: 12/21/2004 08:44:11
On Mon, Dec 20, 2004 at 09:02:27PM -0600, Greg Oster wrote:
> Richard Rauch writes:
[...]
> > On Mon, Dec 20, 2004 at 09:59:14AM -0600, Greg Oster wrote:
> > > Richard Rauch writes:
[...]
> > > I don't know what warnings you were getting from CCD, so it's hard to answer
> > > that :)
> >
> > I figured that there weren't too many candidate messages... (^&
>
> But forgot that some of us are too lazy to look them up ;)
(^&
> > > > 2) I would have thought that a RAID 0 and a ccd, with the same
> > > > stripe size on the same partitions of the same disks, would perform
> > > > very nearly identically. Yet with ccd and a stripe size of about
> > > > 1000,
> > >
> > > This sounds... "high". How are you managing to feed it 1000 blocks
> >
> > I think that these suggested numbers came from the vinum docs. I thought
> > that ccd also suggested those numbers, but in retrospect, I can't find
> > support for that. I might have been thinking in terms of "cylinders",
> > though, and using the disklabel cylinder size. As I said elsewhere,
> > the "usual" 63-block reserved spot at the beginning of the disk was
> > not enough to make ccdconfig happy, so I went up to a cylinder-size as
> > disklabel reports it. That may have been part of the impetetus for
> > the larger stripes.
> >
> >
> > > of whatever so that it stripes across all disks for a single IO (for
> > > hopefully optimal performance)? :)
> >
> > How the disk buffers sort themselves out, I don't know. I expect that most of
> > the space on this system will be used by files starting around 100MB in size,
> > on up. So even though bonnie++ *is* just a benchmark, it may not be a bad
> > one, with its default of 300MB for filesize. (^& I assume that bonnie++ is
> > using the usual stdio features of fputc() and fwrite().
>
> 300MB is typically a bit small these days.. sizeof(RAM) is usally a
> bit better.
Well, the NFS server has less than that, so 300MB forces the server
to move bits on and off disk.
The client (where I do most editing) has 512MB. Even that isn't
always enough, though. When it isn't, it needs to talk to the server.
And that happens at ethernet (100Mbit/sec) speeds, so...if the disk can
do sustained rates of 10MB or so per second on large files (bonnie++
indicates rather more than that), I'm going to wind up waiting on the
network anyway.
Short of upgrading to gigabit, or trying to juggle 2+ NICs for one
NFS mount (can that be done?) on both ends, that's where it mostly
ends.
> It's just that with restrictions like MAXPHYS, RAIDframe (and CCD) never
> gets presented with more than 64K of data at a time... So for very
> large stripe sizes, you're probably only touching one disk for a
> given IO.
(ponder)
On reflection, I see that I not only don't understand what the kernel
does with disk I/O (I don't play with the kernel sources; (^&), but
I also had a bad "fuzzy mental picture".
> > Here is one result, with ccd:
> >
> > ccd0 (both wd1, wd2; softdep; interleave 32; 1 cable)
> [snip]
> > Here's a run with a somewhat larger stripe size:
> >
> > ccd0 (both wd1, wd2; softdep; interleave 1024; 1 cable)
> [snip]
>
> 1 cable means two (both?) drives on the same cable?
Yes. It was annotation that I tacked on while I was doing the tests.
It's a little terse. In that case, I was curious how one cable
would differ from two if the disks were being used in tandem. I had
run a number of single-cable tests with ccd. I don't think that I
bothered comparing RAID with just 1 cable.
[...]
> For further giggles, try the benchmarks using just a single disk...
Did that one, too. Here's a sample:
wd1a, softdep mount
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
cyclopes 300M 36367 83 37560 27 12245 11 28770 94 50729 30 158.3 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 1216 97 +++++ +++ 14136 99 1183 94 996 99 3170 95
cyclopes,300M,36367,83,37560,27,12245,11,28770,94,50729,30,158.3,1,16,1216,97,+++++,+++,14136,99,1183,94,996,99,3170,95
Some things are signiicantly lower. Mostly it's about the same.
(I also have a comparison-only bonnie++ run on the 4GB IBM
drive that that machine uses for /. (^&)
[...]
> > Well, yes. Let me put it another way:
> >
> > What is some of the overhead that makes RAID 0 perform significantly
> > slower (to my estimation) at seeking in these tests?
>
> Even for RAID 0, RAIDframe constructs a directed, acyclic graph to
> describe the IO operation. It then traverses this graph from single
> source to single sink, "firing" nodes along the way. And while this
> provides a very general way of describing disk IO, all that graph
> creation, traversing, and teardown does take some time.
Interesting... I'm surprised that when seeks are running on the order
of 100 to 250 per second, there is that much work for an 800MHz Athlon.
> For 2 disks in a RAID 0 config, try a stripe width of 64. If the
> filesystem is going to have large files on it, a block/frag setting
> of 65536/8192 might yield quite good performance.
Thanks. I'll give it a spin. (^&
I had actually tried at 64 stripe size. In fact, reviewing, it seems
that one of those performed fairly close to ccd for seeks (a little over
220 seeks/sec) and otherwise was about as good as I was going to get:
raid 0 (ffs; softdeps; 2 cables; 64 stripe)
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
cyclopes 300M 35343 89 39401 36 15060 16 27032 93 78939 54 223.5 2
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 1176 97 +++++ +++ 14202 99 1171 95 959 99 3120 93
cyclopes,300M,35343,89,39401,36,15060,16,27032,93,78939,54,223.5,2,16,1176,97,+++++,+++,14202,99,1171,95,959,99,3120,93
I didn't fool with the disklabel much. Maybe I should have. I did try
telling newfs to use different block-sizes. Here's a modification of the
above, with newfs blocksize of 64K:
raid 0 (same as above, but newfs block-size of 64K)
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
cyclopes 300M 36782 89 38826 35 17878 20 27493 94 84635 56 136.1 2
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 1215 98 +++++ +++ 12901 99 1197 98 959 99 3274 94
cyclopes,300M,36782,89,38826,35,17878,20,27493,94,84635,56,136.1,2,16,1215,98,+++++,+++,12901,99,1197,98,959,99,3274,94
Then, again, here's the current config over NFS (the end-result that really
affects me):
raid 0 (raid stripe of 64; softdep; normal newfs; NFS-mounted)
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
socrates 600M 11206 8 11199 1 3513 1 11355 16 11370 1 116.0 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 480 0 5581 6 3550 4 484 0 5880 5 1484 2
socrates,600M,11206,8,11199,1,3513,1,11355,16,11370,1,116.0,0,16,480,0,5581,6,3550,4,484,0,5880,5,1484,2
--
"I probably don't know what I'm talking about." http://www.olib.org/~rkr/