NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

diagnosis for disk drive errors (zfs on cgd on sata disk)



After a recent drive failure in my primary zfs pool, I set
up a secondary pool on a cgd(4) device on a single new sata
hdd (zfs on gpt on cgd on gpt on a 4TB Seagate Ironwolf
hdd) to back up the primary.

I initialy scrubbed the entire disk without apparent
incident using a temporary cryptographic device and dd(1)
as in the cgdconfig(8) man page.

Since then, twice already, in the past two days, the drive
has failed in the same way and been detached, once on the
very first zfs(8) create operation, and the second time
(after a reboot) after having written hundreds of GiBs to
it with a zfs(8) send/receive pipe.  Here are the relevant
system messages:

# dmesg
...
[ 57131.573806] mpii0: physical device removed from slot 7
[ 57131.573806] sd7d: error writing fsbn 1816866262 of 1816866262-1816866389 (sd7 bn 1816866262; cn 894127 tn 1 sn 71)
[ 57131.573806] cgd0d: error writing fsbn 1816604078 of 1816604078-1816604205 (cgd0 bn 1816604078; cn 887013 tn 0 sn 1454)
[ 57131.573806] sd7d: error reading fsbn 270904 of 270904-270919 (sd7 bn 270904; cn 133 tn 5 sn 13)
[ 57131.573806] sd7d: error reading fsbn 7814028344 of 7814028344-7814028359 (sd7 bn 7814028344; cn 3845486 tn 6 sn 30)
[ 57131.573806] sd7d: error reading fsbn 7814028856 of 7814028856-7814028871 (sd7 bn 7814028856; cn 3845486 tn 10 sn 34)
[ 57131.573806] sd7: autoconfiguration error: cache synchronization failed
[ 57131.573806] cgd0d: error reading fsbn 7813766672 of 7813766672-7813766687 (cgd0 bn 7813766672; cn 3815315 tn 0 sn 1552)
[ 57131.573806] cgd0d: error reading fsbn 7813766160 of 7813766160-7813766175 (cgd0 bn 7813766160; cn 3815315 tn 0 sn 1040)
[ 57131.573806] cgd0d: error reading fsbn 8720 of 8720-8735 (cgd0 bn 8720; cn 4 tn 0 sn 528)
[ 57131.573806] sd7d: error writing fsbn 1816866646 of 1816866646-1816866773 (sd7 bn 1816866646; cn 894127 tn 4 sn 74)
[ 57131.573806] cgd0d: error writing fsbn 1816604462 of 1816604462-1816604589 (cgd0 bn 1816604462; cn 887013 tn 0 sn 1838)
[ 57131.573806] sd7d: error writing fsbn 1816866518 of 1816866518-1816866645 (sd7 bn 1816866518; cn 894127 tn 3 sn 73)
[ 57131.573806] cgd0d: error writing fsbn 1816604334 of 1816604334-1816604461 (cgd0 bn 1816604334; cn 887013 tn 0 sn 1710)
[ 57131.593815] sd7: autoconfiguration error: cache synchronization failed
[ 57131.643840] dk11 at sd7 (backupcgd0) deleted
[ 57131.643840] dk10 at sd7 (backupcgd0.config) deleted
[ 57131.643840] sd7: detached

I don't know how to go about diagnosing the issue and would
appreciate any suggestions.  In particular, the hdd is new
and I wonder if I should return it for a replacement.  The
previous disk in the same bay had also been showing
read/write errors (the other drive never got detached,
though).

Apart from the drive, I have also little faith in the
backplate, cables, SAS controller (which I reflashed), RAM,
etc., although here it looks to me like the problem could
be somewhere between the drive and the controller.

Many thanks,
Pouya

N.B. I'm also a bit confused by how zfs is handling this:
zpool(8) appears to think the drive is still online, while
zfs(8) doesn't list any datasets on it:

# zpool status -v puddle
  pool: puddle
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: none requested
config:

	NAME              STATE     READ WRITE CKSUM
	puddle            ONLINE       0 3.62K     0
	  wedges/backup0  ONLINE       0   213     0

errors: Permanent errors have been detected in the following files:

        puddle/backup.pond/backup:<0x0>
        puddle/backup.pond/backup:<0x10ecc5>

# zfs list puddle
cannot open 'puddle': pool I/O is currently suspended


Home | Main Index | Thread Index | Old Index