NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/58775 (apei(4) spamming console)



The following reply was made to PR kern/58775; it has been noted by GNATS.

From: Taylor R Campbell <campbell%mumble.net@localhost>
To: Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
Cc: gnats-bugs%netbsd.org@localhost, gnats-admin%netbsd.org@localhost
Subject: Re: kern/58775 (apei(4) spamming console)
Date: Fri, 25 Oct 2024 18:36:32 +0000

 > Date: Fri, 25 Oct 2024 17:13:11 +0200
 > From: Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
 > 
 > On Fri, 25 Oct 2024 14:06:24 +0000, Taylor R Campbell wrote:
 > >> 
 > >> apei0 message flood unchanged, unfortunately.
 > > 
 > > Can you please share dmesg output?
 > 
 > <ftp://ftp.causeuse.org/pub/NetBSD/kern-58775/kern-58775.dmesg_patched.gz>
 
 Great, thanks.
 
 So, you're getting a variety of different kinds of correctable PCI
 errors (RECEIVER_ERROR, REPLAY_TIMER_TIMEOUT, RECEIVER_ERROR,
 BAD_DLLP), on PCI devices at bus 128 dev 3 function 1 (ppb5) and at
 bus 129 dev 0 function 0 (nvme0),[*] and either they're not going away
 when acknowledged or they're coming back after the OS has acknowledged
 them.
 
 This could be harmless but it could also be evidence that your
 hardware is failing, so let's take a closer look just in case this
 isn't simply a false alarm.
 
 Can you share the output of the following commands?
 
 pcictl pci0 list -N
 pcictl pci0 list -n -N
 pcictl pci0 dump -b 0x80 -d 3 -f 0
 pcictl pci0 dump -b 0x80 -d 3 -f 1
 pcictl pci0 dump -b 0x80 -d 3 -f 2
 pcictl pci0 dump -b 0x81 -d 0 -f 0
 nvmectl devlist
 nvmectl identify nvme0
 nvmectl logpage -p 1 nvme0
 nvmectl logpage -p 2 nvme0
 nvmectl logpage -p 3 nvme0
 
 (I'm also tempted to suggest you try re-seating any PCI cards you
 have, in particular the Samsung NVMe card, but for the moment I want
 to take advantage of the hardware errors to test apei(4) driver
 support for PCIe errors!  This is the first machine I've had access to
 which is exercising these paths in practice so it's time for SCIENCE
 (if that's OK with you -- consensual science is the best science).)
 
 
 [*] I had to decode the b/d/f from the DeviceID={...} lines, since my
     `PCI %04x:%02x:%02x:%u' printf was broken when I accidentally
     sized the buffer with sizeof("0000:00:00.000") instead of
     sizeof("PCI 0000:00:00.000").  Will be fixed in the next version
     of the patch!
 


Home | Main Index | Thread Index | Old Index