Subject: adaptec 2940 disaster
To: None <port-i386@NetBSD.ORG>
From: Carl Shapiro <samsara@panix.com>
List: port-i386
Date: 10/04/1997 05:24:03
I was doing a *lot* of disk I/O and thrashing my CPU when suddenly my
console froze. Oh no...
I rebooted an started to fsck by hand. While fsck'ing / I got the following
error:
ahc1: ahc_scsi_cmd: more than 256 DMA segs
And then my console froze. So I rebooted. Bad stuff began to happen.
For some reason the 2940 couldn't do a mode sense with my Seagate
Barracude (st1505n) and then claimed it had to use a "ficticious
geometry". Then when the machine was trying to mount root on sd0a I
got the following errors:
sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0x47
sd0(ahc:1:1:0): asserted ATN - device reset in message buffer
sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0xb6
ahc1: Issued Channel A bus reset #1, 1 SCB aborted
It reported that mounting root failed with error 79, and tried to mount
root again... more errors:
sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0xe6
sd0(ahc:1:1:0): asserted ATN - device reset in message buffer
sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0xfb
ahc1: Issued Channel A bus reset #1, 1 SCB aborted
sd0(ahc:1:1:0): timed out in datain phase, SCSISIGI == 0x0
ahc1: Issued Channel A bus reset #2, 1 SCB aborted
I tried mounting root again... more errors like the above. I tried
again and again and again and again. I then gave up, I powered down,
and waited. This time the 2940 could sense the geometry of my drive,
and I was able to mount root. Unfortunately I had lost /bin/sh (turns
out that I lost all of /bin) and I couldn't boot single user mode.
I threw a spare drive (a Fujitsu Pico 7 M1606S) on my SCSI chain and
booted NetBSD from that. The machine booted flawlessly. However
almost all of the filesystems on my Barracuda are completely blown away
except for my tiny (16 meg) / which only lost /bin. When I tried to
fsck the /usr of the Barracuda (I used fsck -y because there were just
*so* many errors) I got the familiar:
ahc1: ahc_scsi_cmd: more than 256 DMA segs
But the machine didn't hang. Oh well.
Given that I have not touched anything inside of the ahc driver, what
could have caused this disaster? How can I prevent such a thing from
happing again? My system more or less looks like this:
Intel PR440FX Motherboard
Pentium Pro 200
64 megabytes RAM
on board Adaptec 2940UW (unused)
Adaptec 2940
on board Intel EtherExpress Pro 10/100B (also unused)
Is the bha or isp drivers more stable than the ahc on the i386? I wouldn't
mind dropping $200 for a more reliable system.
Carl