Re: NetBSD 9.1 upgrade and file system crash - reboot fails

To: Martin Husemann <martin%duskware.de@localhost>
Subject: Re: NetBSD 9.1 upgrade and file system crash - reboot fails
From: Riccardo Mottola <riccardo.mottola%libero.it@localhost>
Date: Mon, 02 Nov 2020 08:39:00 +0000

Hi Martin,

On 2020-10-30 14:51:28 +0000 Martin Husemann <martin%duskware.de@localhost>wrote:

On Fri, Oct 30, 2020 at 03:41:55PM +0100, Riccardo Mottola wrote:

A lot of errors.... and the system is not bootable anymore! I get:
NetBSD MBR boot....
Non-System disk or disk error


This is very early MBR boot sector failure, it should not be related
to the fsck issue - but maybe your disk starts to act up?

could be... the boot part should not be affected by akernel/filesystem error, right? (except something very bad likeout-of-partition access or such).

The disk should be pretty new, but read below.

I would start checking fdisk output for the disk - is it still as
expected? Does it show a NetBSD partition with expected size?


Disk: /dev/wd0
NetBSD disklabel disk geometry:
cylinders: 155061, heads: 16, sectors/track: 63 (1008 sectors/cylinder)
total sectors: 156301488, bytes/sector: 512

BIOS disk geometry:
cylinders: 1022, heads: 240, sectors/track: 63 (15120 sectors/cylinder)
total sectors: 156301488

Partitions aligned to 15120 sector boundaries, offset 63

Partition table:
0: NetBSD (sysid 169)
    start 64, size 156301424 (76319 MB, Cyls 0/1/2-10337/95/63), Active
1: <UNUSED>
2: <UNUSED>
3: <UNUSED>
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x00000000)


disklabel:
4 partitions:
#        size    offset     fstype [fsize bsize cpg/sgs]

a: 151173728 64 4.2BSD 0 0 0 # (Cyl.0*- 149973)b: 5127696 151173792 swap # (Cyl. 149974- 155060)c: 156301424 64 unused 0 0 # (Cyl.0*- 155060)d: 156301488 0 unused 0 0 # (Cyl. 0- 155060)

offset ad size of c matches with the partition table. Is that fineenough?

Then compare the disklabel output, does it match?

If that is ok, install bootloader again.

I Installed anyway and got the machine booting again.. then did allthe chekcs. All important data is backed up, the only inconvenience isthe typical setup-reinstall, etc.

Also use atactl to check the smart status of the disk.


How reliable is that data?

I checked SMART status, it looks a little worrying:
SMART supported, SMART enabled

id value thresh crit collect reliability descriptionraw1 58 34 yes online positive Raw read error rate27218486

  3  96    0     yes online  positive    Spin-up time                0

4 95 20 no online positive Start/stop count6082

  5 100   36     yes online  positive    Reallocated sector count    13

7 81 30 yes online positive Seek error rate1256263839 95 0 no online positive Power-on hours count4752

 10 100   34     yes online  positive    Spin retry count            0

12 98 20 no online positive Device power cycle count2790192 99 0 no online positive Power-off retract count2791193 18 0 no online positive Load cycle count165436194 37 0 no online positive Temperature37 Lifetime min/max 0/11195 58 0 no online positive Hardware ECC Recovered27218486

197 100    0     no  online  positive    Current pending sector      0
198 100    0     no  offline positive    Offline uncorrectable       0
199 200    0     no  online  positive    Ultra DMA CRC error count   0
200 100    0     no  offline positive    Write error rate            0
202 100    0     no  online  positive    Data address mark errors    0

13 reallocated sectors, if one of them is on the MBR, who knows? Butalso the number of cycles and power-on is high, but reasonable. Theread & Seek look incredibily high. So I thought of writing this to afile, checking the next day and then today again, just do see whatincreases.


The day after:
SMART supported, SMART enabled

id value thresh crit collect reliability descriptionraw1 59 34 yes online positive Raw read error rate232650323

  3  96    0     yes online  positive    Spin-up time                0

4 95 20 no online positive Start/stop count6088

  5 100   36     yes online  positive    Reallocated sector count    13

7 81 30 yes online positive Seek error rate1266919679 95 0 no online positive Power-on hours count4762

 10 100   34     yes online  positive    Spin retry count            0

12 98 20 no online positive Device power cycle count2793192 99 0 no online positive Power-off retract count2794193 17 0 no online positive Load cycle count166041194 29 0 no online positive Temperature29 Lifetime min/max 0/11195 59 0 no online positive Hardware ECC Recovered232650323

197 100    0     no  online  positive    Current pending sector      0
198 100    0     no  offline positive    Offline uncorrectable       0
199 200    0     no  online  positive    Ultra DMA CRC error count   0
200 100    0     no  offline positive    Write error rate            0
202 100    0     no  online  positive    Data address mark errors    0

Some stuff makes sense.. like +10 more hours, a couple of start/stopconts more. Bug e.g. the number of hardware error recorvered is 10times higher? The same for the raw read error wow...

Then this is the data for the third day (each time I did a power-offreboot, so it is not continuous operation, I shut down the laptop atnight)


SMART supported, SMART enabled

id value thresh crit collect reliability descriptionraw1 60 34 yes online positive Raw read error rate73875073

  3  96    0     yes online  positive    Spin-up time                0

4 95 20 no online positive Start/stop count6088

  5 100   36     yes online  positive    Reallocated sector count    13

7 81 30 yes online positive Seek error rate1270505619 95 0 no online positive Power-on hours count4771

 10 100   34     yes online  positive    Spin retry count            0

12 98 20 no online positive Device power cycle count2793192 99 0 no online positive Power-off retract count2794193 17 0 no online positive Load cycle count166675194 28 0 no online positive Temperature28 Lifetime min/max 0/11195 60 0 no online positive Hardware ECC Recovered73875073

197 100    0     no  online  positive    Current pending sector      0
198 100    0     no  offline positive    Offline uncorrectable       0
199 200    0     no  online  positive    Ultra DMA CRC error count   0
200 100    0     no  offline positive    Write error rate            0
202 100    0     no  online  positive    Data address mark errors    0

The number of read errors skyrocketed!

The number of reallocated sector remains the same and this is theonly... reassuring thing.


Riccardo

Follow-Ups:
- Re: NetBSD 9.1 upgrade and file system crash - reboot fails
  - From: Rhialto
- Re: NetBSD 9.1 upgrade and file system crash - reboot fails
  - From: Michael van Elst
- Re: NetBSD 9.1 upgrade and file system crash - reboot fails
  - From: BERTRAND Joël

References:
- Re: NetBSD 9.1 upgrade and file system crash - reboot fails
  - From: Martin Husemann

Prev by Date: Re: [Q] 9.1 amd64 openJDK11 error on certificates
Next by Date: Re: NetBSD 9.1 upgrade and file system crash - reboot fails
Previous by Thread: Re: NetBSD 9.1 upgrade and file system crash - reboot fails
Next by Thread: Re: NetBSD 9.1 upgrade and file system crash - reboot fails
Indexes:

Home | Main Index | Thread Index | Old Index