NetBSD-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
On Wed, 14 Jul 2021 at 12:07, Matthias Petermann <mp%petermann-it.de@localhost> wrote:
>
> Hello all,
>
>
> ```
> [ 87240.313853] wd2: (uncorrectable data error)
> [ 87240.313853] wd2d: error reading fsbn 5707914328 of
> 5707914328-5707914455 (wd2 bn 5707914328; cn 5662613 tn 6 sn 46)
> [ 87465.637977] wd2d: error reading fsbn 5710464152 of
> 5710464152-5710464215 (wd2 bn 5710464152; cn 5665143 tn 0 sn 8), xfer
> 338, retry 0
> [ 87465.637977] wd2: (uncorrectable data error)
> [ 87475.561683] wd2: soft error (corrected) xfer 338
> [ 87506.393194] wd2d: error reading fsbn 5710555128 of
> 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer
> 40, retry 0
> [ 87506.393194] wd2: (uncorrectable data error)
> [ 87515.156465] wd2d: error reading fsbn 5710555128 of
> 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer
> 40, retry 1
> ```
>
> The whole syslog is full of these messages. What surprises me is that
> there are "uncorrectable" data errors in the syslog. Nevertheless, the
> data can still be read - albeit very slowly. My assumption was that the
> redundancies of RAID2 are being used to compensate for the defects. To
> my surprise, ZFS does not seem to have noticed any of these defects:
>
The wd driver is retrying, (IIRC it retries 3 times) and suceeding on
the second or 3rd attempt. (See xfer 338, retry 0, followed by a 'soft
error corrected' with the same xfer number 10 seconds later. This is
the retry suceeding).
This sits below ZFS and therefore ZFS never sees the error. If the
read failed 3 times you'd probably get a data error in ZFS.
>
> For the sake of completeness, here is the issue of S.M.A.R.T. - even if
> I find it difficult to interpret:
>
> ```
> saturn$ doas atactl wd2 smart status
> SMART supported, SMART enabled
> id value thresh crit collect reliability description raw
> 1 197 51 yes online positive Raw read error rate 38669
> 3 176 21 yes online positive Spin-up time 6158
> 4 100 0 no online positive Start/stop count 510
> 5 200 140 yes online positive Reallocated sector count 0
I was expecting to see this value greater than 0 if the drive was
failing, is the drive bad or the cabling?
Cheers,
Ian
Home |
Main Index |
Thread Index |
Old Index