Subject: File System Corruption
To: None <port-alpha@netbsd.org>
From: Ray Phillips <r.phillips@mailbox.uq.edu.au>
List: port-alpha
Date: 01/09/2002 20:48:00
Dear NetBSD/alpha:
I have NetBSD/alpha version 1.5.2 running on a 3000/400 with the
system disk (the only one at the moment) mounted internally. About a
week after setting this machine up it crashed with messages like
these on its console:
asc0: STATUS_PHASE: msg 2
sd0(asc0:2:0): max sync rate 5.00MB/s
(asc0:2:0): selection failed; 3 left in FIFO [intr 18, stat 93, step 3]
sd0(asc0:2:0): asc0: timed out [ecb 0xfffffe000001e150 (flags 0x1,
dleft 2000, >
sd0(asc0:2:0): Check Condition on CDB: 0x0a 01 1b d0 10 00
SENSE KEY: Aborted Command
ASC/ASCQ: SCSI Parity Error
asc0: SCSI bus parity error
dev = 0x803, ino = 157, fs = /usr
panic: ifree: freeing free inode
Stopped in nmbd at cpu_Debugger+0x4: ret zero,(ra)
db>
Some, such as the first, were repeated *many* times. When I
rebooted, problems were found in its file system:
Automatic boot in progress: starting file system checks.
/dev/rsd0a: UNALLOCATED I=8299 OWNER=root MODE=0
/dev/rsd0a: SIZE=0 MTIME=Dec 24 18:00 2001
NAME=/var/log/messages.5.gz
/dev/rsd0a: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY.
Automatic file system check failed; help!
Dec 24 18:47:42 init: /bin/sh on /etc/rc terminated abnormally, going
to singlee
Enter pathname of shell or RETURN for sh:
When I ran fsck_ffs on /dev/rsd0a and /dev/rsd0d I told it to:
- correct all incorrect block counts it mentioned
- clear the files it said had unknown type
- fix files it said had bad type values
- remove files it said were unallocated
- reconnect directories it said were unref'ed, and
- adjust the link count for files it said had an incorrect value
There were many of each type of error. Luckily the files it
suggested I remove were ones I could easily replace--mostly from the
NetBSD distribution. After this, the machine booted normally, but
the following morning it had crashed again with the the same
symptoms. I concluded the SCSI controller for the internal bus must
be faulty and attached the system disk to the external bus,agreed?
There've been no crashes in the week since then, so that seems
likely. I presume the internal SCSI controller chip is soldered to
the system board and hence not replaceable?
I tried to replace the system files I'd deleted with fsck by booting
from a NetBSD CD, running sh from sysinst's utility menu and then
mount /dev/cd0a /mnt2
mount /dev/sd0a /mnt
mount /dev/sd0d /mnt/usr
cd /mnt
pax -zrpe -f /mnt2/alpha/binary/sets/base.tgz
When pax was running it generated a few error messages, which I can't
find now and can't quote verbatim, but they mentioned not being able
to extract some files because something couldn't be unlinked. So, it
seems there are still some errors in the file system. Is it likely
the only way to remove them is to newfs the disk?
By the way, why are sendmail and named included in the NetBSD
distribution? They're the only non-system programs that are aren't
they? I (carelessly, I'll admit) overwrote version 8.12.1 of
sendmail which I'd previously installed using pax as above.
Upgrading from one version of NetBSD to another would be simpler in
this case if sendmail and co. weren't in the way.
Ray