NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/58775 (apei(4) spamming console)
The following reply was made to PR kern/58775; it has been noted by GNATS.
From: Taylor R Campbell <campbell%mumble.net@localhost>
To: Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
Cc: gnats-bugs%netbsd.org@localhost, gnats-admin%netbsd.org@localhost
Subject: Re: kern/58775 (apei(4) spamming console)
Date: Fri, 25 Oct 2024 18:36:32 +0000
> Date: Fri, 25 Oct 2024 17:13:11 +0200
> From: Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
>
> On Fri, 25 Oct 2024 14:06:24 +0000, Taylor R Campbell wrote:
> >>
> >> apei0 message flood unchanged, unfortunately.
> >
> > Can you please share dmesg output?
>
> <ftp://ftp.causeuse.org/pub/NetBSD/kern-58775/kern-58775.dmesg_patched.gz>
Great, thanks.
So, you're getting a variety of different kinds of correctable PCI
errors (RECEIVER_ERROR, REPLAY_TIMER_TIMEOUT, RECEIVER_ERROR,
BAD_DLLP), on PCI devices at bus 128 dev 3 function 1 (ppb5) and at
bus 129 dev 0 function 0 (nvme0),[*] and either they're not going away
when acknowledged or they're coming back after the OS has acknowledged
them.
This could be harmless but it could also be evidence that your
hardware is failing, so let's take a closer look just in case this
isn't simply a false alarm.
Can you share the output of the following commands?
pcictl pci0 list -N
pcictl pci0 list -n -N
pcictl pci0 dump -b 0x80 -d 3 -f 0
pcictl pci0 dump -b 0x80 -d 3 -f 1
pcictl pci0 dump -b 0x80 -d 3 -f 2
pcictl pci0 dump -b 0x81 -d 0 -f 0
nvmectl devlist
nvmectl identify nvme0
nvmectl logpage -p 1 nvme0
nvmectl logpage -p 2 nvme0
nvmectl logpage -p 3 nvme0
(I'm also tempted to suggest you try re-seating any PCI cards you
have, in particular the Samsung NVMe card, but for the moment I want
to take advantage of the hardware errors to test apei(4) driver
support for PCIe errors! This is the first machine I've had access to
which is exercising these paths in practice so it's time for SCIENCE
(if that's OK with you -- consensual science is the best science).)
[*] I had to decode the b/d/f from the DeviceID={...} lines, since my
`PCI %04x:%02x:%02x:%u' printf was broken when I accidentally
sized the buffer with sizeof("0000:00:00.000") instead of
sizeof("PCI 0000:00:00.000"). Will be fixed in the next version
of the patch!
Home |
Main Index |
Thread Index |
Old Index