Subject: Re: kern/32757: TLB IPI rendezvous fails sometimes
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-bugs
Date: 02/06/2006 22:10:03
The following reply was made to PR kern/32757; it has been noted by GNATS.
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
netbsd-bugs@NetBSD.org
Subject: Re: kern/32757: TLB IPI rendezvous fails sometimes
Date: Mon, 6 Feb 2006 23:06:26 +0100
--SUOF0GtieIMvvwua
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
On Mon, Feb 06, 2006 at 09:50:01AM +0000, seebs wrote:
> Machine: i386
> >Description:
> On at least some motherboards, NetBSD 2.1 occasionally fails with TLB
> IPI rendezvous failed. The patch (from pmap.c 1.184) is verified
> present.
> >How-To-Repeat:
> Run under load.
>
> Someone else on the NetBSD lists reports the same behavior with a
> Pentium 3 system, suggesting that this isn't just a specific
I'm the one who reported the problem. Hardware is PIII-1Ghz on a
MSI 694D-Pro 2 motherboard:
mainbus0 (root)
mainbus0: Intel MP Specification (Version 1.4) (OEM00000 PROD00000000)
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel Pentium III (686-class), 1002.37 MHz, id 0x68a
cpu0: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
cpu0: features 387fbff<FXSR,SSE>
cpu0: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
cpu0: L2 cache 256 KB 32B/line 8-way
cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu0: serial number 0000-068A-0001-DDD6-4ED7-4704
cpu0: calibrating local timer
cpu0: apic clock running at 133 MHz
cpu0: 8 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: starting
cpu1: Intel Pentium III (686-class), 1002.28 MHz, id 0x68a
cpu1: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu1: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
cpu1: features 387fbff<FXSR,SSE>
cpu1: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
cpu1: L2 cache 256 KB 32B/line 8-way
cpu1: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu1: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu1: serial number 0000-068A-0003-ADAB-C15A-1E54
pchb0 at pci0 dev 0 function 0
pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4)
pcib0 at pci0 dev 7 function 0
pcib0: VIA Technologies VT82C686A PCI-ISA Bridge (rev. 0x40)
viaide0 at pci0 dev 7 function 1
viaide0: VIA Technologies VT82C686A (Apollo KX133) ATA100 controller
I still see it with NetBSD 3.0, both for TLB IPIs and FPU IPIs.
I'm running with the attached patch, all my systems are stable with
this. I have several systems based on the same hardware running SMP, with
different workloads, all of them show the problems from once a day to
once in several weeks, depending on the workload.
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
--SUOF0GtieIMvvwua
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename=diff
Index: i386/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/i386/pmap.c,v
retrieving revision 1.181.2.2
diff -u -r1.181.2.2 pmap.c
--- i386/pmap.c 26 Sep 2005 20:24:52 -0000 1.181.2.2
+++ i386/pmap.c 6 Feb 2006 19:37:12 -0000
@@ -3652,6 +3652,7 @@
int s;
#ifdef DIAGNOSTIC
int count = 0;
+ int ipi_retry = 0;
#endif
#endif
@@ -3672,6 +3673,9 @@
/*
* Send the TLB IPI to other CPUs pending shootdowns.
*/
+#ifdef DIAGNOSTIC
+ipi_again:
+#endif
for (CPU_INFO_FOREACH(cii, ci)) {
if (ci == self)
continue;
@@ -3683,9 +3687,20 @@
while (self->ci_tlb_ipi_mask != 0) {
#ifdef DIAGNOSTIC
- if (count++ > 10000000)
+ if (count++ > 10000000) {
+ for (CPU_INFO_FOREACH(cii, ci)) {
+ if (ci == self)
+ continue;
+ printf("CPU %ld interrupt level 0x%x pending "
+ "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
+ ci->ci_ilevel, ci->ci_ipending,
+ ci->ci_idepth, ci->ci_ipis);
+ }
+ if (ipi_retry++ < 5)
+ goto ipi_again;
panic("TLB IPI rendezvous failed (mask %x)",
self->ci_tlb_ipi_mask);
+ }
#endif
x86_pause();
}
Index: isa/npx.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/isa/npx.c,v
retrieving revision 1.107
diff -u -r1.107 npx.c
--- isa/npx.c 3 Feb 2005 21:08:58 -0000 1.107
+++ isa/npx.c 6 Feb 2006 19:37:12 -0000
@@ -732,6 +732,8 @@
} else {
#ifdef DIAGNOSTIC
int spincount;
+ int ipi_retry = 0;
+ipi_again:
#endif
IPRINTF(("%s: fp ipi to %s %s lwp %p\n",
@@ -750,6 +752,16 @@
#ifdef DIAGNOSTIC
spincount++;
if (spincount > 10000000) {
+ printf("CPU %ld interrupt level 0x%x pending "
+ "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
+ ci->ci_ilevel, ci->ci_ipending,
+ ci->ci_idepth, ci->ci_ipis);
+ printf("CPU %ld interrupt level 0x%x pending "
+ "0x%x depth %d ci_ipis %d\n", oci->ci_cpuid,
+ oci->ci_ilevel, oci->ci_ipending,
+ oci->ci_idepth, oci->ci_ipis);
+ if (ipi_retry++ < 5)
+ goto ipi_again;
panic("fp_save ipi didn't");
}
#endif
--SUOF0GtieIMvvwua--