NetBSD-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
On Wed, 14 Jul 2021, Greg Troxel wrote:
What I do is for each of my (physical) disks, spinning and ssd, is (x86
centric; c for others), once every few months
dd if=/dev/rwd0d of=/dev/null bs=1m
and see if that throws any errors. If there is one, I try to read that
block a few times, and generally then will 1) take that as a sign to
replace the disk (or move it to an nth external backup) and 2) write
that sector, so that it gets reallocated. If the disk is part of raid1,
You can make the drive itself do that whole disk scan and collect
the `offline' statistics while it is doing so. This is using the
smartmontools package:
root# smartctl -t long /dev/XXX
The command will show how long it'll take for that test to complete
(a few hours for TB-capacity drives). After the command completes
(or to check on test progress) run:
root# smartctl --all /dev/XXX > /tmp/XXX.smart-log.txt
saturn$ doas atactl wd2 smart status
SMART supported, SMART enabled
id value thresh crit collect reliability description raw
1 197 51 yes online positive Raw read error rate 38669
3 176 21 yes online positive Spin-up time 6158
4 100 0 no online positive Start/stop count 510
5 200 140 yes online positive Reallocated sector count 0
7 200 0 no online positive Seek error rate 0
9 64 0 no online positive Power-on hours count 26740
10 100 0 no online positive Spin retry count 0
11 100 0 no online positive Calibration retry count 0
12 100 0 no online positive Device power cycle count 506
192 200 0 no online positive Power-off retract count 99
193 200 0 no online positive Load cycle count 2672
194 117 0 no online positive Temperature 33
196 200 0 no online positive Reallocated event count 0
197 200 0 no online positive Current pending sector 18
This is the big deal. The drive has decided that 18 sectors are not
ok. It will reallocate them when written, but it is returned
uncorrectable to avoid making that silent data loss for the OS.
198 100 0 no offline positive Offline uncorrectable 0
199 200 0 no online positive Ultra DMA CRC error count 0
200 100 0 no offline positive Write error rate 0
mp@: What's surprising is, apart from that `Current pending sector'
count--which hasn't dropped below the threshold (none of the current
values have), how pristine the drive looks. Is it a new drive? If
it is, then sector reallocation happening on it is a worry. Are
the cables also OK? Check them, too.
As a comparison, here's what my 15 year old drive looks like:
$ sudo atactl wd0 smart status
SMART supported, SMART enabled
id value thresh crit collect reliability description raw
1 119 6 yes online positive Raw read error rate 227910048
3 99 0 yes online positive Spin-up time 0
4 93 20 no online positive Start/stop count 7741
5 100 36 yes online positive Reallocated sector count 0
7 82 30 yes online positive Seek error rate 4464083330
9 74 0 no online positive Power-on hours count 83567178701327
10 100 97 yes online positive Spin retry count 0
12 93 20 no online positive Device power cycle count 7724
184 100 99 no online positive End-to-end error 0
187 100 0 no online positive Reported Uncorrectable Errors 0
188 100 0 no online positive Command Timeout 0
189 100 0 no online positive High Fly Writes 0
190 67 45 no online positive Airflow Temperature 33 Lifetime min/max 23/0
191 100 0 no online positive G-sense error rate 179
192 100 0 no online positive Power-off retract count 730
193 1 0 no online positive Load cycle count 1041873
194 33 0 no online positive Temperature 33 Lifetime min/max 0/19
196 77 30 yes online positive Reallocated event count 172189533884560
197 100 0 no online positive Current pending sector 0
198 100 0 no offline positive Offline uncorrectable 0
199 200 0 no online positive Ultra DMA CRC error count 0
240 77 0 no offline positive Head flying hours 172189533884560
241 100 0 no offline positive Total LBAs Written 1786969693
242 100 0 no offline positive Total LBAs Read 3326934803
254 100 0 no online positive Free Fall Sensor 0
None of the current value fields have dropped below their thresholds.
FYI mp@: https://www.smartmontools.org/wiki/FAQ
-RVP
Home |
Main Index |
Thread Index |
Old Index