Subject: Re: some problems with "old" RAIDframe arrays on netbsd-1-6
To: Greg Oster <oster@cs.usask.ca>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 10/19/2003 19:13:50
[ On Sunday, October 19, 2003 at 15:13:27 (-0600), Greg Oster wrote: ]
> Subject: Re: some problems with "old" RAIDframe arrays on netbsd-1-6
>
> Can you send the output of "raidctl -s raid0" and of "disklabel foo0"
> where "foo0" contains one of the active components? RAIDframe is
> usually only crabby about these sorts of things if there is an actual
> size difference that will cause a problem.
Note that if I'm reading the kernel message right it was apparently
seeing the spare partition as only 512 sectors:
Spare disk /dev/sd6a (512 blocks) is too small to serve as a spare (need 8890688 blocks)
The underlying problem I'm having which requires a new component on this
array is that one of the disks, and/or SCSI isolators in the hot swap
bays, is going bad. I either get a bus parity error, or something like:
sd12(ahc1:0:8:0): Unexpected busfree in Data-in phase
SEQADDR == 0x113
sd12(ahc1:0:8:0): generic HBA error
Of course after a reboot all the disks on that shelf look fine until the
drive gets used a bit. Indeed it will usually work long enough to do a
full reconstruct.
This time when I rebooted sd12d originally looked "optimal", but still
had "autoconfig" disabled in its component label so I had to manually
configure raid0 to get it working.
So while still in single user mode I failed /dev/sd12d and reconstructed
right back to it (-R), then did a forced fsck just to make sure
everything was OK, and it was.
Note too that when I did the manual re-configure the new "sd6d" appeared
as a spare as expected because I now had it listed in the raid0.conf
file, but perhaps since it wasn't formally added, and since the array
was already initialized, it disappeared again on the next reboot.
Anyway here's how it looked after I re-constructed sd12d in single user:
# raidctl -v -s raid0
Components:
/dev/sd7d: optimal
/dev/sd8d: optimal
/dev/sd9d: optimal
/dev/sd10d: optimal
/dev/sd11d: optimal
/dev/sd12d: optimal
Spares:
/dev/sd6d: spare
Component label for /dev/sd7d:
Row: 0, Column: 0, Num Rows: 1, Num Columns: 6
Version: 2, Serial Number: 2, Mod Counter: 533
Clean: No, Status: 0
sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 8890688
RAID Level: 5
Autoconfig: Yes
Root partition: No
Last configured as: raid0
Component label for /dev/sd8d:
Row: 0, Column: 1, Num Rows: 1, Num Columns: 6
Version: 2, Serial Number: 2, Mod Counter: 533
Clean: No, Status: 0
sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 8890688
RAID Level: 5
Autoconfig: Yes
Root partition: No
Last configured as: raid0
Component label for /dev/sd9d:
Row: 0, Column: 2, Num Rows: 1, Num Columns: 6
Version: 2, Serial Number: 2, Mod Counter: 533
Clean: No, Status: 0
sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 8890688
RAID Level: 5
Autoconfig: Yes
Root partition: No
Last configured as: raid0
Component label for /dev/sd10d:
Row: 0, Column: 3, Num Rows: 1, Num Columns: 6
Version: 2, Serial Number: 2, Mod Counter: 533
Clean: No, Status: 0
sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 8890688
RAID Level: 5
Autoconfig: Yes
Root partition: No
Last configured as: raid0
Component label for /dev/sd11d:
Row: 0, Column: 4, Num Rows: 1, Num Columns: 6
Version: 2, Serial Number: 2, Mod Counter: 533
Clean: No, Status: 0
sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 8890688
RAID Level: 5
Autoconfig: Yes
Root partition: No
Last configured as: raid0
Component label for /dev/sd12d:
Row: 0, Column: 5, Num Rows: 1, Num Columns: 6
Version: 2, Serial Number: 2, Mod Counter: 533
Clean: No, Status: 0
sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
Queue size: 100, blocksize: 512, numBlocks: 8890688
RAID Level: 5
Autoconfig: Yes
Root partition: No
Last configured as: raid0
/dev/sd6d status is: spare. Skipping label.
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
It didn't take long before sd12 failed again with the scsi error shown
above appearing on the console, followed of course by:
raid0: IO Error. Marking /dev/sd12d as failed.
raid0: node (Rod) returned fail, rolling backward
sd12(ahc1:0:8:0): generic HBA error
raid0: Disk /dev/sd12d is already marked as dead!
raid0: node (Rod) returned fail, rolling backward
raid0: DAG failure: w addr 0x13937df (20527071) nblk 0x20 (32) buf 0xc5a4e000
raid0: DAG failure: w addr 0x13874df (20477151) nblk 0x20 (32) buf 0xc5a42000
Now as I mentioned after the second reboot sd6d disappeared as a spare
again as well:
# raidctl -v -s raid0
Components:
/dev/sd7d: optimal
/dev/sd8d: optimal
/dev/sd9d: optimal
/dev/sd10d: optimal
/dev/sd11d: optimal
/dev/sd12d: failed
No spares.
Component label for /dev/sd7d:
[[ ... snip ... ]]
/dev/sd12d status is: failed. Skipping label.
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
Here's the disklabel from the first of the original components:
# disklabel sd7
# /dev/rsd7d:
type: SCSI
disk: QUANTUM_X34550WD
label:
flags:
bytes/sector: 512
sectors/track: 150
tracks/cylinder: 10
sectors/cylinder: 1500
cylinders: 5899
total sectors: 8890760
rpm: 7200
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0 # microseconds
track-to-track seek: 0 # microseconds
drivedata: 0
4 partitions:
# size offset fstype [fsize bsize cpg/sgs]
d: 8890760 0 RAID # (Cyl. 0 - 5927*)
(Note that before I upgraded the fstype was "unused" though as I said it
still auto-configured, but after I upgraged I had to change it to "RAID"
of course to get autoconfig to work on these "old" arrays.)
Here's the disklabel from sd6, which I'm about to try re-adding as a
spare again:
# disklabel sd6
# /dev/rsd6d:
type: SCSI
disk: VIKING 4.5 WSE
label: raid0-spare
flags:
bytes/sector: 512
sectors/track: 181
tracks/cylinder: 8
sectors/cylinder: 1448
cylinders: 6144
total sectors: 8896512
rpm: 7200
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0 # microseconds
track-to-track seek: 0 # microseconds
drivedata: 0
4 partitions:
# size offset fstype [fsize bsize cpg/sgs]
d: 8896512 0 RAID # (Cyl. 0 - 6143)
(Note this time I've just left it at its full-disk size...)
This time of course it works for some unexplained reason:
# raidctl -v -a /dev/sd6e raid0
#
and from the console:
Warning: truncating spare disk /dev/sd6d to 8890688 blocks
So, whatever the problem was it's not easily reproducible.
It's now beginning the reconstruction.....
(and perhaps because sd6 is on a separate bus from sd7-sd12 it is now
claiming only 21 minutes -- about half the time it took to reconstruct
to sd12 before!)
--
Greg A. Woods
+1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>