tech-kern: WD_SOFTBADSECT, WD_QUIRK_FORCE_LBA48 (improving the robustness of the IDE

Subject: WD_SOFTBADSECT, WD_QUIRK_FORCE_LBA48 (improving the robustness of the IDE
To: None <port-i386@netbsd.org, tech-kern@netbsd.org>
From: None <davef1624@aol.com>
List: tech-kern
Date: 10/05/2005 13:30:32
  We are seeing several apparent reliability issues with the IDE drives 
we're using.
  Some of the drives experience a bad sector/block after only ~ 5,000 - 
10,000 hours of operation.
 In addition, the IDE drive sometimes cannot spare out the bad block.
  When we run the 'smartmon' diagnostics on the disk- they usually pass 
the Health Check fine,
  but fail the extended diagnostics (usually because of repeated bad 
read errors from the disk).

  Also, fsck and other system processes will repeatedly retry reading 
and/or writing these bad blocks:

 >kernel: pciide0:1:0: device timeout, c_bcount=8192, c_skip0
 >kernel: pciide0 channel 1: reset failed for drive 0
  >kernel: wd0a: device timeout reading fsbn 8288336 of 8288336-8288351 
(wd0 bn 8288336; cn 8222 tn 8 sn 56), retrying
 >kernel: pciide0:1:0: not ready, st=0x80, err=0x00
  >kernel: wd0a: device timeout reading fsbn 8288336 of 8288336-8288351 
(wd0 bn 8288336; cn 8222 tn 8 sn 56), retrying
 >kernel: wd0: soft error (corrected)
  >kernel: pciide0:1:0: bus-master DMA error: missing interrupt, 
status=0x21
 >kernel: pciide0:1:0: device timeout, c_bcount=65536, c_skip0
  >kernel: wd0a: device timeout reading fsbn 8343104 of 8343104-8343231 
(wd0 bn 8343104; cn 8276 tn 14 sn 14), retrying

  Therefore, I'm looking into any critical fixes that would improve our 
system's resiliency to these kinds of errors;
 our system needs to be as robust as possible.

 There appear to be several alternatives:

  1) Use the WD_SOFTBADSECT 'automatic bad-sector list' fix - introduced 
on Apr 15, 2003
 (Revision 1.241 of wd.c).
 My question concerns the following (taken from wd(4) man-page):

  > This feature does not interoperate well with the sector remapping 
features of modern disks.
  > To let the disk remap a sector internally, the software bad sector 
list must be flushed or disabled before.

  Can anyone further explain this to me?  How would I remap a bad sector 
when using WD_SOFTBADSECT?
 I'd like to avoid having to reboot if possible.

  2) Use the WD_QUIRK_FORCE_LBA48 feature. Can anyone explain this 
feature to me as well?

  3) Use RAIDframe for data mirroring; we only have one physical drive 
in the system though.
  Is it possible to use RAID to perform data mirroring onto two separate 
file-system partitions on the same drive?
  This would help to protect us from bad disk blocks on an otherwise 
working drive.

 Thanks again for your help,
 Dave

 -----Original Message-----
 From: Manuel Bouyer <bouyer@antioche.eu.org>
 To: davef1624@aol.com
 Cc: port-i386@NetBSD.org; tech-kern@NetBSD.org
 Sent: Wed, 28 Sep 2005 19:37:43 +0200
 Subject: Re: WD_SOFTBADSECT usage ?

 On Wed, Sep 28, 2005 at 01:52:27AM -0400, davef1624@aol.com wrote:
 >
  > We're currently using a fairly 'old' wd.c driver & 1.6 NetBSD kernel 
--
 > from Nov 1, 2002 to be exact.
 >
 > I'm wondering if there are any critical bug fixes (to either wd.c,
 > ata*, pciide* drivers) that might impact
  > disk driver/subsystem reliability and/or error recovery since this 
date?

  Probably, but if you don't have problems, I'm not sure why you worry 
:)

 >
  > One fix that I noticed was the WD_SOFTBADSECT automatic bad-sector 
list
 > management on Apr 15, 2003
 > (Revision 1.241 of wd.c).
 >
 > This fix appears to improve the error recovery of the disk driver by
 > not attempting *repeated* reads
 > on failed (unrecoverable) disk blocks.
 >
 > What are the tradeoffs here? Can I safely turn on this feature?

  Probably, as long as you're aware what you need to do to remap a bad 
sector.

 --
 Manuel Bouyer <bouyer@antioche.eu.org>
 NetBSD: 26 ans d'experience feront toujours la difference
 --