Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?

To: Matthias Petermann <mp%petermann-it.de@localhost>
Subject: Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
From: Greg Troxel <gdt%lexort.com@localhost>
Date: Wed, 14 Jul 2021 08:10:17 -0400

Matthias Petermann <mp%petermann-it.de@localhost> writes:

> I run a NetBSD-based NAS at home. It is currently running on NetBSD 9.1. =

Probably you should bring it forward along netbsd-9, but that's likely
unrelated.

> The system is booted from a USB stick on which the root file system is
> also located. The storage is on 4 x 4 TB magnetic hard disks, configured
> as ZFS RAIDZ2.
>
> Earlier I noticed that the I/O performance of the system suddenly
> collapsed drastically. A look at the syslog gives a pretty clear
> indication of the reason:
>

> [ 87240.313853] wd2: (uncorrectable data error)
> [ 87240.313853] wd2d: error reading fsbn 5707914328 of 5707914328-5707914455 (wd2 bn 5707914328; cn 5662613 tn 6 sn 46)

> [ 87465.637977] wd2d: error reading fsbn 5710464152 of 5710464152-5710464215 (wd2 bn 5710464152; cn 5665143 tn 0 sn 8), xfer 338, retry 0
> [ 87465.637977] wd2: (uncorrectable data error)

> [ 87475.561683] wd2: soft error (corrected) xfer 338

> [ 87506.393194] wd2d: error reading fsbn 5710555128 of 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer 40, retry 0
> [ 87506.393194] wd2: (uncorrectable data error)

> [ 87515.156465] wd2d: error reading fsbn 5710555128 of 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer 40, retry 1

You seem to be having both correctable and uncorrectable errors.

> The whole syslog is full of these messages. What surprises me is that
> there are "uncorrectable" data errors in the syslog. Nevertheless, the

Why?  These are the OS reading a block from wd2, and getting a
notification from the controller  that the block could not be read.
This happens as disks become troubled, and I've seen it often over the
years (over many systems; it's not often on any given system).

> data can still be read - albeit very slowly. My assumption was that the

You have to separate "can be read from wd2" and "can be read from the
zfs raidz2".

> redundancies of RAID2 are being used to compensate for the defects. To
> my surprise, ZFS does not seem to have noticed any of these defects:

I think you may have uncovered a bug in zfs statistics.
>          NAME        STATE     READ WRITE CKSUM
>          tank        ONLINE       0     0     0
>            raidz2-0  ONLINE       0     0     0
>              dk0     ONLINE       0     0     0
>              dk1     ONLINE       0     0     0
>              dk2     ONLINE       0     0     0
>              dk3     ONLINE       0     0     0

It really seems like dk2 (assuming dk2 == wd2) should have some read errors.

> Another indication that ZFS has not yet noticed the error: with top,
> there is no significant CPU load during I/O, neither in the user nor
> the system area. I would have expected this at least in the case when
> ZFS works with redundancies.

It's more or less xor for raidz1, so compared to disk read times, I'd
expect no real cpu hit.   I am unclear on raidz2 but surely it's not
public key crypto.   The corresponding operation is being done on every
write to create the redundant bits.   This may be slightly helpful,
merely interesting, or neither:
  https://queue.acm.org/detail.cfm?id=1670144

> So it looks like the hardware error can still be corrected as far as
> possible at the level of the device driver, which makes me doubt the
> truth of the statement "uncorrectable data error".

What I do is for each of my (physical) disks, spinning and ssd, is (x86
centric; c for others), once every few months

  dd if=/dev/rwd0d of=/dev/null bs=1m

and see if that throws any errors.  If there is one, I try to read that
block a few times, and generally then will 1) take that as a sign to
replace the disk (or move it to an nth external backup) and 2) write
that sector, so that it gets reallocated.  If the disk is part of raid1,
I can write it with good data.  If not, I write with zeros and fsck.  I
am a big fan of replacing disks that show errors, but sometimes one
can't and that's my workaround.

> Does anyone know what would have to happen for ZFS to notice the
> hardware defect?

I bet zfs got a read failed and did the reconstruction but didn't log
it.   But I'm guessing that that's a good thing to figure out.

> saturn$ doas atactl wd2 smart status
> SMART supported, SMART enabled
> id value thresh crit collect reliability description                 raw
>    1 197   51     yes online  positive    Raw read error rate         38669
>    3 176   21     yes online  positive    Spin-up time                6158
>    4 100    0     no  online  positive    Start/stop count            510
>    5 200  140     yes online  positive    Reallocated sector count    0
>    7 200    0     no  online  positive    Seek error rate             0
>    9  64    0     no  online  positive    Power-on hours count        26740
>   10 100    0     no  online  positive    Spin retry count            0
>   11 100    0     no  online  positive    Calibration retry count     0
>   12 100    0     no  online  positive    Device power cycle count    506
>  192 200    0     no  online  positive    Power-off retract count     99
>  193 200    0     no  online  positive    Load cycle count            2672
>  194 117    0     no  online  positive    Temperature                 33

>  196 200    0     no  online  positive    Reallocated event count     0
>  197 200    0     no  online  positive    Current pending sector      18

This is the big deal.  The drive has decided that 18 sectors are not
ok.  It will reallocate them when written, but it is returned
uncorrectable to avoid making that silent data loss for the OS.

>  198 100    0     no  offline positive    Offline uncorrectable       0

>  199 200    0     no  online  positive    Ultra DMA CRC error count   0
>  200 100    0     no  offline positive    Write error rate            0

Probably if you take that drive out and put it in a test box and write
zeros to the whole drive and then read back it will be sort of ok, but I
wouldn't trust it.

Attachment: signature.asc
Description: PGP signature

Follow-Ups:
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Matthias Petermann
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: RVP

References:
- ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Matthias Petermann

Prev by Date: Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
Next by Date: Re: mpv and estd and --vo=xv
Previous by Thread: Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
Next by Thread: Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
Indexes:

Home | Main Index | Thread Index | Old Index