Re: kern/58775 (apei(4) spamming console)

To: riastradh%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost, Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
Subject: Re: kern/58775 (apei(4) spamming console)
From: "Taylor R Campbell via gnats" <gnats-admin%NetBSD.org@localhost>
Date: Sat, 26 Oct 2024 22:35:01 +0000 (UTC)

The following reply was made to PR kern/58775; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
Cc: gnats-bugs%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/58775 (apei(4) spamming console)
Date: Sat, 26 Oct 2024 22:34:00 +0000

 > Date: Sun, 27 Oct 2024 00:13:15 +0200
 > From: Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
 >=20
 > I guess I'll hook up the machine's ipmi console on Monday, and see what=20
 > that has to say.

 I would be curious to see any details you can find there!

 > > Can you revert the previous patch and try the attached patch instead,
 > > which applies a rate limit to the console output?
 >=20
 > Done, resulted in a much more reasonable message rate. Thanks!

 Great, can you share the new dmesg output?

 > In the general case, how would I map the "error source" on hardware?

 Not sure there's a good general way to do this -- these correspond to
 SourceId numbers in acpidump.out, and you can follow to the Related
 SourceId numbers, but I'm not sure you get much out of that.  E.g.,
 hardware source 514 is a generic hardware error source which maps to
 the related source:

 	Type=3D{PCI Express Endpoint AER}
 	SourceId=3D257
 	Flags=3D{FIRMWARE_FIRST,GLOBAL}
 	Enabled=3D{ YES (ignored) }
 	Number of Record to pre-allocate=3D1
 	Max. Sections per Record=3D16
 	Device Control=3D0x7
 	Uncorrectable Error Mask Register=3D0x100000
 	Uncorrectable Error Severity Register=3D0x7ef6030
 	Correctable Error Mask Register=3D0x0
 	Advanced Capabilities Register=3D0x0

 Which doesn't really tell us much.

 However, the log messages should show the PCI device identified in the
 error record.  Something like this, in the new patch (now that I've
 fixed the buffer sizing):

 PCI 0000:81:00.000: hardware corrected error: 0x1<RECEIVER_ERROR> (mask=3D0=
 x0)

 This means segment 0, bus 0x81=3D129, device 0x00, and function 0, which
 you can look up in dmesg:

 [   1.0650718] pci8 at ppb5 bus 129
 [   1.0650718] nvme0 at pci8 dev 0 function 0: Samsung Electronics (3rd ven=
 dor ID) PM9A1 M.2 NVMe SSD (rev. 0x00)

 or with pcictl(8):

 pcictl pci0 dump -b 0x81 -d 0 -f 0

 That's how I identified it as your Samsung NVMe card -- specifically,
 the first one, nvme0.  (That said, I don't know how to map that to
 your physical motherboard layout.)

 It is also shown in the DeviceID=3D{...} lines, in somewhat obscure hex
 (https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#p=
 ci-express-error-section),
 which is how I decoded it in spite of the broken format string in the
 first draft of the patch.

Prev by Date: Re: kern/58775 (apei(4) spamming console)
Next by Date: Re: kern/58775 (apei(4) spamming console)
Previous by Thread: Re: kern/58775 (apei(4) spamming console)
Next by Thread: Re: kern/58775 (apei(4) spamming console)
Indexes:

Home | Main Index | Thread Index | Old Index