Hello all,I run a NetBSD-based NAS at home. It is currently running on NetBSD 9.1. The system is booted from a USB stick on which the root file system is also located. The storage is on 4 x 4 TB magnetic hard disks, configured as ZFS RAIDZ2.
Earlier I noticed that the I/O performance of the system suddenly collapsed drastically. A look at the syslog gives a pretty clear indication of the reason:
``` [ 87240.313853] wd2: (uncorrectable data error)[ 87240.313853] wd2d: error reading fsbn 5707914328 of 5707914328-5707914455 (wd2 bn 5707914328; cn 5662613 tn 6 sn 46) [ 87465.637977] wd2d: error reading fsbn 5710464152 of 5710464152-5710464215 (wd2 bn 5710464152; cn 5665143 tn 0 sn 8), xfer 338, retry 0
[ 87465.637977] wd2: (uncorrectable data error) [ 87475.561683] wd2: soft error (corrected) xfer 338[ 87506.393194] wd2d: error reading fsbn 5710555128 of 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer 40, retry 0
[ 87506.393194] wd2: (uncorrectable data error)[ 87515.156465] wd2d: error reading fsbn 5710555128 of 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer 40, retry 1
```The whole syslog is full of these messages. What surprises me is that there are "uncorrectable" data errors in the syslog. Nevertheless, the data can still be read - albeit very slowly. My assumption was that the redundancies of RAID2 are being used to compensate for the defects. To my surprise, ZFS does not seem to have noticed any of these defects:
``` saturn$ doas zpool status pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 dk0 ONLINE 0 0 0 dk1 ONLINE 0 0 0 dk2 ONLINE 0 0 0 dk3 ONLINE 0 0 0 errors: No known data errors ```Another indication that ZFS has not yet noticed the error: with top, there is no significant CPU load during I/O, neither in the user nor the system area. I would have expected this at least in the case when ZFS works with redundancies.
So it looks like the hardware error can still be corrected as far as possible at the level of the device driver, which makes me doubt the truth of the statement "uncorrectable data error".
Does anyone know what would have to happen for ZFS to notice the hardware defect?
Next, I will try to take the wd2 (dk2) component offline.For the sake of completeness, here is the issue of S.M.A.R.T. - even if I find it difficult to interpret:
``` saturn$ doas atactl wd2 smart status SMART supported, SMART enabled id value thresh crit collect reliability description raw 1 197 51 yes online positive Raw read error rate 38669 3 176 21 yes online positive Spin-up time 6158 4 100 0 no online positive Start/stop count 510 5 200 140 yes online positive Reallocated sector count 0 7 200 0 no online positive Seek error rate 0 9 64 0 no online positive Power-on hours count 26740 10 100 0 no online positive Spin retry count 0 11 100 0 no online positive Calibration retry count 0 12 100 0 no online positive Device power cycle count 506 192 200 0 no online positive Power-off retract count 99 193 200 0 no online positive Load cycle count 2672 194 117 0 no online positive Temperature 33 196 200 0 no online positive Reallocated event count 0 197 200 0 no online positive Current pending sector 18 198 100 0 no offline positive Offline uncorrectable 0 199 200 0 no online positive Ultra DMA CRC error count 0 200 100 0 no offline positive Write error rate 0 ``` Kind regards Matthias
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature