Subject: Re: File System Corruption
To: Ray Phillips <r.phillips@mailbox.uq.edu.au>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-alpha
Date: 01/09/2002 22:28:49
On Wed, Jan 09, 2002 at 08:48:00PM +1000, Ray Phillips wrote:
> Dear NetBSD/alpha:
>
> I have NetBSD/alpha version 1.5.2 running on a 3000/400 with the
> system disk (the only one at the moment) mounted internally. About a
> week after setting this machine up it crashed with messages like
> these on its console:
>
> asc0: STATUS_PHASE: msg 2
> sd0(asc0:2:0): max sync rate 5.00MB/s
> (asc0:2:0): selection failed; 3 left in FIFO [intr 18, stat 93, step 3]
> sd0(asc0:2:0): asc0: timed out [ecb 0xfffffe000001e150 (flags 0x1,
> dleft 2000, >
> sd0(asc0:2:0): Check Condition on CDB: 0x0a 01 1b d0 10 00
> SENSE KEY: Aborted Command
> ASC/ASCQ: SCSI Parity Error
> asc0: SCSI bus parity error
> dev = 0x803, ino = 157, fs = /usr
> panic: ifree: freeing free inode
> Stopped in nmbd at cpu_Debugger+0x4: ret zero,(ra)
> db>
>
> Some, such as the first, were repeated *many* times. When I
> rebooted, problems were found in its file system:
>
> Automatic boot in progress: starting file system checks.
> /dev/rsd0a: UNALLOCATED I=8299 OWNER=root MODE=0
> /dev/rsd0a: SIZE=0 MTIME=Dec 24 18:00 2001
> NAME=/var/log/messages.5.gz
>
> /dev/rsd0a: UNEXPECTED INCONSISTENCY; RUN fsck_ffs MANUALLY.
> Automatic file system check failed; help!
> Dec 24 18:47:42 init: /bin/sh on /etc/rc terminated abnormally, going
> to singlee
> Enter pathname of shell or RETURN for sh:
>
> When I ran fsck_ffs on /dev/rsd0a and /dev/rsd0d I told it to:
> - correct all incorrect block counts it mentioned
> - clear the files it said had unknown type
> - fix files it said had bad type values
> - remove files it said were unallocated
> - reconnect directories it said were unref'ed, and
> - adjust the link count for files it said had an incorrect value
>
> There were many of each type of error. Luckily the files it
> suggested I remove were ones I could easily replace--mostly from the
> NetBSD distribution. After this, the machine booted normally, but
> the following morning it had crashed again with the the same
> symptoms. I concluded the SCSI controller for the internal bus must
> be faulty and attached the system disk to the external bus,agreed?
> There've been no crashes in the week since then, so that seems
> likely. I presume the internal SCSI controller chip is soldered to
> the system board and hence not replaceable?
The parity error would point to a problem between the SCSI
chip and the SCSI connector, so I'm not sure remplacing the SCSI chip
will solve it.
> [...]
> When pax was running it generated a few error messages, which I can't
> find now and can't quote verbatim, but they mentioned not being able
> to extract some files because something couldn't be unlinked. So, it
> seems there are still some errors in the file system. Is it likely
> the only way to remove them is to newfs the disk?
No, this is probably because of something else, like some files
gained a flag. Try ls -lo on these files.
If there are a lot of them newfs may be the faster way of solving it, though.
--
Manuel Bouyer <bouyer@antioche.eu.org>
--