Subject: Re: SMP stability issues
To: Chris Rendle-Short <jim@tty1.rr.nu>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: tech-smp
Date: 11/12/2006 11:45:13
--k+w/mQv8wyuph6w0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Sun, Nov 12, 2006 at 01:23:34PM +1100, Chris Rendle-Short wrote:
> Well, I just tried running GENERIC.MPACPI like some of the others suggested,
> however it is still locking up. Here is the dmesg from GENERIC.MPACPI
> (although it looks like I might need to check my ACPI configuration in the
> BIOS.
It looks kike it's using ACPI
> I will also try a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG enabled
> as you suggested. Is it likely to matter whether or not ACPI is enabled in
> the test kernel?
Yes, these checks are independant from ACPI vs MPBIOS
> pchb0 at pci0 dev 0 function 0
> pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4)
OK, this is the same motherboard as I have here (I have several of theses). I
also have issues with them, I guess the debug options
will show you that the CPU is missing IPI interrupts on occasion.
If so, the attached patch should help (my boxes are rock solid with this
patch). Note that it's only active if you have
options DIAGNOSTIC
in your kernel config.
Acutally I suspect this is a bug in the chipset; I have Intel-based dual-PIII
motherboards which don't have this issue, nor do P4 SMP systems.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
--k+w/mQv8wyuph6w0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="diff.via"
Index: i386/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/i386/pmap.c,v
retrieving revision 1.181.2.2
diff -u -r1.181.2.2 pmap.c
--- i386/pmap.c 26 Sep 2005 20:24:52 -0000 1.181.2.2
+++ i386/pmap.c 12 Nov 2006 10:42:15 -0000
@@ -3652,6 +3652,7 @@
int s;
#ifdef DIAGNOSTIC
int count = 0;
+ int ipi_retry = 0;
#endif
#endif
@@ -3672,6 +3673,9 @@
/*
* Send the TLB IPI to other CPUs pending shootdowns.
*/
+#ifdef DIAGNOSTIC
+ipi_again:
+#endif
for (CPU_INFO_FOREACH(cii, ci)) {
if (ci == self)
continue;
@@ -3683,9 +3687,20 @@
while (self->ci_tlb_ipi_mask != 0) {
#ifdef DIAGNOSTIC
- if (count++ > 10000000)
+ if (count++ > 10000000) {
+ for (CPU_INFO_FOREACH(cii, ci)) {
+ if (ci == self)
+ continue;
+ printf("CPU %ld interrupt level 0x%x pending "
+ "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
+ ci->ci_ilevel, ci->ci_ipending,
+ ci->ci_idepth, ci->ci_ipis);
+ }
+ if (ipi_retry++ < 5)
+ goto ipi_again;
panic("TLB IPI rendezvous failed (mask %x)",
self->ci_tlb_ipi_mask);
+ }
#endif
x86_pause();
}
Index: isa/npx.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/isa/npx.c,v
retrieving revision 1.107.4.1
diff -u -r1.107.4.1 npx.c
--- isa/npx.c 12 May 2006 15:41:46 -0000 1.107.4.1
+++ isa/npx.c 12 Nov 2006 10:42:16 -0000
@@ -752,6 +752,8 @@
} else {
#ifdef DIAGNOSTIC
int spincount;
+ int ipi_retry = 0;
+ipi_again:
#endif
IPRINTF(("%s: fp ipi to %s %s lwp %p\n",
@@ -770,6 +772,16 @@
#ifdef DIAGNOSTIC
spincount++;
if (spincount > 10000000) {
+ printf("CPU %ld interrupt level 0x%x pending "
+ "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
+ ci->ci_ilevel, ci->ci_ipending,
+ ci->ci_idepth, ci->ci_ipis);
+ printf("CPU %ld interrupt level 0x%x pending "
+ "0x%x depth %d ci_ipis %d\n", oci->ci_cpuid,
+ oci->ci_ilevel, oci->ci_ipending,
+ oci->ci_idepth, oci->ci_ipis);
+ if (ipi_retry++ < 5)
+ goto ipi_again;
panic("fp_save ipi didn't");
}
#endif
--k+w/mQv8wyuph6w0--