Subject: how to mess up your week working around failing disk
To: None <current-users@netbsd.org>
From: Sean Doran <smd@ebone.net>
List: current-users
Date: 10/13/1999 06:22:34
fun with NetBSD system administration...
1. have an IBM DGHS09U develop bad blocks in the middle of the disk, but
fortunately in the middle of an unimportant partition. (one more use
for many partitions; you can deploy empty/unused ones to contain clusters
of errors). have this situation last a long time.
2. observe scsictl reassign working, start reassigning blocks in the
bad partition, which are discovered when reading back large (cda)
files written into it.
3. grovel through ibm tech docs to see what it's complaining about when
the reassign doesn't always work, or causes the drive to think a long time
discover you probably have to replace your disk soonishly.
4. just as you are moving things around to take backups (while doing
all sorts of work with pkgsrc/audio/lame), crash. decide now is a
good time to rearrange disks so that the two DGHS09Us aren't so close
together, so as to avoid heating up.
be annoyed when this takes much longer than expected thanks to poor
design of PC gear, particularly in how things are mounted (grrrr)
5. boot. bad dghs9u makes upset noises and refuses to talk to you again,
other than to complain about internal errors. grr.
6. break out ancient (1.3) IDE drive and boot that (no floppy)
upgrade that to -current (thanks laptop!)
rebuild system on 3 remaining 9gb scsi drives
store away ide drive again
note that in the presence of 3 scsi disks, and biosboot.sym
from -current, with SCSI,A,C boot sequence and no floppy,
"dev" shows only sd0:
wd0: does nothing
sd3: lets you at the contents of the ide drive (it's not atapi)
i can't remember what happens when i set C,A,SCSI
7. have fun figuring out how to boot everything. -current bootblocks
seem unhappy with 4 scsi drives and one ide drive to choose from,
but i can't really characterize that, since i got the ide drive
out of the system fast. with all 5 disks, "ls" hangs, trying to boot
locks things up to the point where the reset button needs pushing.
boot from sd2. sd2a is root partition. mental note to self was: be sure
to move sd2 to sd0 and sd0 furthest back in the scsi numbering.
8. rebuild everything...
9. observe that snapshot from XFree86 now does 1600x1200x24 on my screen
which isn't supposed to be able to do that. whee. too bad my motherboard
is ancient, slow, and has too little memory... :(
10. play play play play (insomnia)
11. system wedges up. crunch. hit reset button. get distracted, expecting
bootblocks to hang not being able to talk to sd0
12. (&(*&^*&^#@$5 sd0 woke up. how i discover this is observing
fsck vomit over lots of things (IDUPs in particular), in sd0 (!!)
and in a partition on sd2.
13. fsck everything. even sd0 fscks clean running through it a few times.
14. while in single user mode, start doing dumps of sd0 into files
in big partitions on other disks. yay, i can retrieve what's there.
sd0a - no problem
sd0b - gets well into Pass IV and then
panic: getblk: block size invariant failed
Stopped in dump at Debugger+0x4: leave
db> t/t
[typed in manually]
Debugger(...)
panic(f0215260,fe56f038,fe537b40,ffffffff,0) at panic + 0x55
getblk(fe56f038,71f100,4000,0,0) at getblk+0xec
bread(fe56038,71f100,4000,ffffffff,fe580e54) at bread +0x2d
spec_read(fe580e94,fe580ea8,f016d55e,fe580e94,fe580f88) at spec_read
ufsspec_read(damn rsi)
vn_read(ditto)
dofileread(ditto)
sys_read(ditto)
syscall()
--- syscall (number 3) ---
0x807dbaf:
maybe this will dump core...
my guess is that this is sd0 failing again. anybody wanna have
me do anything before i type more at db> ? i need an rsi break, so
if you're fast you can help improve NetBSD's resilience in pessimally
ugly i/o situations. -:)
next step: open chassis, change scsi ids, boot single user,
change fstab, if possible, make dumps of all the sd0 partitions.
RMA disk with ibm. sigh.
Sean.