netbsd-help: Disk drivers and data errors

Subject: Disk drivers and data errors
To: None <netbsd-help@netbsd.org>
From: Stephen Borrill <netbsd@precedence.co.uk>
List: netbsd-help
Date: 07/26/2005 12:30:53

You may recall my previous worries about a large number of disk failures
we've had recently. These have been particularly highlighted by using
RAIDframe on them (read/write errors cause potential loss of a filesystem,
not just a file). All these failures have been on one model of machine
(we've supplied a number of types of machines in the past and have never
had any problems like this) and a couple of different models of drives.
This has been seen on both 2.0_STABLE and 1.6.2_STABLE (we added the
relevant recognition lines to pciide.c on 1.6). Relevant bits of dmesg:

piixide0 at pci0 dev 31 function 2
piixide0: Intel 82801EB Serial ATA Controller (rev. 0x02)
piixide0: bus-master DMA support present
piixide0: primary channel configured to compatibility mode
piixide0: primary channel interrupting at irq 14
wd0 at atabus0 drive 0: <Maxtor 7Y250M0>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 233 GB, 486344 cyl, 16 head, 63 sec, 512 bytes/sect x 490234752 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1 at atabus0 drive 1: <Maxtor 7Y250M0>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 233 GB, 486344 cyl, 16 head, 63 sec, 512 bytes/sect x 490234752 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)

We've also seen the same with Maxtor 6Y080M0 disks. As we've only seen
this problem with Intel 82801EB drivers, 7Y250M0/6Y080M0 and RAIDframe,
I'm sure one of those must be a fault. It _could_ be a dodgy batch of
drives, but these machines have been purchased over quite a long period
and with different drive models. I'm wondering whether these are phantom
errors caused by a buggy driver or missing quirk.

Example errors:

wd0e: error reading fsbn 332551232 of 332551232-332551295 (wd0 bn
336682016; cn 33 4009 tn 14 sn 62), retrying
wd0: (uncorrectable data error)

These errors aren't at random addresses (i.e. they are consistent per
machine), but they differ from machine to machine (i.e. it's not some off
address-related fault). We've also seen address mark not found errors.

smartd says things like:
server smartd[295]: Device: /dev/wd1d, 110 Currently unreadable (pending) sectors
server smartd[295]: Device: /dev/wd1d, 110 Offline uncorrectable sectors

P.S. When trying to reconstruct a RAID 1 array onto a failed component,
why should it panic if unable to write:

raid0: initiating in-place reconstruction on column 0
raid0: Recon write failed!
panic: raidframe error at line 880 file
/usr/work/netmanager/netbsd/usr/src/sys/arch/i386/compile/NETMANRAID/../../../../dev/raidframe/rf_reconstruct.c

Any thoughts appreciated,

--
Stephen