current-users: how to mess up your week working around failing disk

Subject: how to mess up your week working around failing disk
To: None <current-users@netbsd.org>
From: Sean Doran <smd@ebone.net>
List: current-users
Date: 10/13/1999 06:22:34
fun with NetBSD system administration...

1. have an IBM DGHS09U develop bad blocks in the middle of the disk, but
   fortunately in the middle of an unimportant partition.  (one more use
   for many partitions; you can deploy empty/unused ones to contain clusters
   of errors).   have this situation last a long time.

2. observe scsictl reassign working, start reassigning blocks in the
   bad partition, which are discovered when reading back large (cda)
   files written into it.

3. grovel through ibm tech docs to see what it's complaining about when
   the reassign doesn't always work, or causes the drive to think a long time

   discover you probably have to replace your disk soonishly.

4. just as you are moving things around to take backups (while doing
   all sorts of work with pkgsrc/audio/lame), crash.   decide now is a
   good time to rearrange disks so that the two DGHS09Us aren't so close
   together, so as to avoid heating up.

   be annoyed when this takes much longer than expected thanks to poor
   design of PC gear, particularly in how things are mounted (grrrr)

5. boot.  bad dghs9u makes upset noises and refuses to talk to you again,
   other than to complain about internal errors.   grr.

6. break out ancient (1.3) IDE drive and boot that (no floppy) 
   upgrade that to -current (thanks laptop!)
   rebuild system on 3 remaining 9gb scsi drives
   store away ide drive again

   note that in the presence of 3 scsi disks, and biosboot.sym
   from -current, with SCSI,A,C boot sequence and no floppy,
   "dev" shows only sd0:
   wd0: does nothing
   sd3: lets you at the contents of the ide drive (it's not atapi)
   
   i can't remember what happens when i set C,A,SCSI

7. have fun figuring out how to boot everything.  -current bootblocks
   seem unhappy with 4 scsi drives and one ide drive to choose from,
   but i can't really characterize that, since i got the ide drive
   out of the system fast.  with all 5 disks,  "ls" hangs, trying to boot
   locks things up to the point where the reset button needs pushing.

   boot from sd2.  sd2a is root partition.  mental note to self was: be sure
   to move sd2 to sd0 and sd0 furthest back in the scsi numbering.

8. rebuild everything...

9. observe that snapshot from XFree86 now does 1600x1200x24 on my screen
   which isn't supposed to be able to do that.  whee.  too bad my motherboard
   is ancient, slow, and has too little memory... :(

10. play play play play (insomnia)

11. system wedges up.  crunch.  hit reset button.  get distracted, expecting
    bootblocks to hang not being able to talk to sd0

12.  (&(*&^*&^#@$5  sd0 woke up.  how i discover this is observing
     fsck vomit over lots of things (IDUPs in particular), in sd0 (!!)
     and in a partition on sd2.

13. fsck everything.  even sd0 fscks clean running through it a few times.

14. while in single user mode, start doing dumps of sd0 into files
    in big partitions on other disks.   yay, i can retrieve what's there.

    sd0a - no problem
    sd0b - gets well into Pass IV and then

panic: getblk: block size invariant failed
Stopped in dump at    Debugger+0x4:   leave
db> t/t  
[typed in manually]
Debugger(...)
panic(f0215260,fe56f038,fe537b40,ffffffff,0) at panic + 0x55
getblk(fe56f038,71f100,4000,0,0) at getblk+0xec
bread(fe56038,71f100,4000,ffffffff,fe580e54) at bread +0x2d
spec_read(fe580e94,fe580ea8,f016d55e,fe580e94,fe580f88) at spec_read
ufsspec_read(damn rsi)
vn_read(ditto)
dofileread(ditto)
sys_read(ditto)
syscall()
--- syscall (number 3) ---
0x807dbaf:

    maybe this will dump core...


   my guess is that this is sd0 failing again.   anybody wanna have
me do anything before i type more at db> ?   i need an rsi break, so
if you're fast you can help improve NetBSD's resilience in pessimally
ugly i/o situations. -:)

   next step: open chassis, change scsi ids, boot single user,
change fstab, if possible, make dumps of all the sd0 partitions.
RMA disk with ibm.   sigh.

	Sean.