Subject: WD_SOFTBADSECT, WD_QUIRK_FORCE_LBA48 (improving the robustness of the IDE
To: None <port-i386@netbsd.org, tech-kern@netbsd.org>
From: None <davef1624@aol.com>
List: tech-kern
Date: 10/05/2005 13:30:32
We are seeing several apparent reliability issues with the IDE drives
we're using.
Some of the drives experience a bad sector/block after only ~ 5,000 -
10,000 hours of operation.
In addition, the IDE drive sometimes cannot spare out the bad block.
When we run the 'smartmon' diagnostics on the disk- they usually pass
the Health Check fine,
but fail the extended diagnostics (usually because of repeated bad
read errors from the disk).
Also, fsck and other system processes will repeatedly retry reading
and/or writing these bad blocks:
>kernel: pciide0:1:0: device timeout, c_bcount=8192, c_skip0
>kernel: pciide0 channel 1: reset failed for drive 0
>kernel: wd0a: device timeout reading fsbn 8288336 of 8288336-8288351
(wd0 bn 8288336; cn 8222 tn 8 sn 56), retrying
>kernel: pciide0:1:0: not ready, st=0x80, err=0x00
>kernel: wd0a: device timeout reading fsbn 8288336 of 8288336-8288351
(wd0 bn 8288336; cn 8222 tn 8 sn 56), retrying
>kernel: wd0: soft error (corrected)
>kernel: pciide0:1:0: bus-master DMA error: missing interrupt,
status=0x21
>kernel: pciide0:1:0: device timeout, c_bcount=65536, c_skip0
>kernel: wd0a: device timeout reading fsbn 8343104 of 8343104-8343231
(wd0 bn 8343104; cn 8276 tn 14 sn 14), retrying
Therefore, I'm looking into any critical fixes that would improve our
system's resiliency to these kinds of errors;
our system needs to be as robust as possible.
There appear to be several alternatives:
1) Use the WD_SOFTBADSECT 'automatic bad-sector list' fix - introduced
on Apr 15, 2003
(Revision 1.241 of wd.c).
My question concerns the following (taken from wd(4) man-page):
> This feature does not interoperate well with the sector remapping
features of modern disks.
> To let the disk remap a sector internally, the software bad sector
list must be flushed or disabled before.
Can anyone further explain this to me? How would I remap a bad sector
when using WD_SOFTBADSECT?
I'd like to avoid having to reboot if possible.
2) Use the WD_QUIRK_FORCE_LBA48 feature. Can anyone explain this
feature to me as well?
3) Use RAIDframe for data mirroring; we only have one physical drive
in the system though.
Is it possible to use RAID to perform data mirroring onto two separate
file-system partitions on the same drive?
This would help to protect us from bad disk blocks on an otherwise
working drive.
Thanks again for your help,
Dave
-----Original Message-----
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: davef1624@aol.com
Cc: port-i386@NetBSD.org; tech-kern@NetBSD.org
Sent: Wed, 28 Sep 2005 19:37:43 +0200
Subject: Re: WD_SOFTBADSECT usage ?
On Wed, Sep 28, 2005 at 01:52:27AM -0400, davef1624@aol.com wrote:
>
> We're currently using a fairly 'old' wd.c driver & 1.6 NetBSD kernel
--
> from Nov 1, 2002 to be exact.
>
> I'm wondering if there are any critical bug fixes (to either wd.c,
> ata*, pciide* drivers) that might impact
> disk driver/subsystem reliability and/or error recovery since this
date?
Probably, but if you don't have problems, I'm not sure why you worry
:)
>
> One fix that I noticed was the WD_SOFTBADSECT automatic bad-sector
list
> management on Apr 15, 2003
> (Revision 1.241 of wd.c).
>
> This fix appears to improve the error recovery of the disk driver by
> not attempting *repeated* reads
> on failed (unrecoverable) disk blocks.
>
> What are the tradeoffs here? Can I safely turn on this feature?
Probably, as long as you're aware what you need to do to remap a bad
sector.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--