Port-i386 archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: HT bug in some Intel CPUs ?
Manuel Bouyer writes:
> Hi,
> after fighting with a upgrade from NetBSD-3 to NetBSD-5/i386 of two
> identical servers, I came to the conclusion that hyperthreading is
> broken on this CPU, causing corrupted registers or memory reads
> (I couldn't determine which).
> The CPU is:
> cpu0: Intel (686-class), 3000.22 MHz, id 0xf4a
> cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
> cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
> cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
> cpu0: features2 641d<SSE3,MONITOR,DS-CPL,CID,xTPR>
> cpu0: features3 20100000<EM64T>
> cpu0: "Intel(R) Xeon(TM) CPU 3.00GHz"
> cpu0: I-cache 12K uOp cache 8-way
> cpu0: L2 cache 2 MB 64B/line 8-way
> cpu0: ITLB 4K/4M: 64 entries
> cpu0: DTLB 4K/4M: 64 entries
Interesting... the CPUs in the box I'm having grief upgrading from
NetBSD-3 to NetBSD-5/i386 look like this:
cpu0: Intel (686-class), 3000.35 MHz, id 0xf41
cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: features2 641d<SSE3,MONITOR,DS-CPL,CID,xTPR>
cpu0: features3 20100000<EM64T>
cpu0: "Intel(R) Xeon(TM) CPU 3.00GHz"
cpu0: I-cache 12K uOp cache 8-way
cpu0: L2 cache 1 MB 64B/line 8-way
cpu0: ITLB 4K/4M: 64 entries
cpu0: DTLB 4K/4M: 64 entries
cpu0: using thermal monitor 1
cpu0: calibrating local timer
cpu0: apic clock running at 200 MHz
cpu0: 32 page colors
They don't show up as hyperthreaded in 3.0, but do in 5.0.1.
> I'll resume my debug session: from symptoms I came to the conclusion that
> ci_ilevel was maybe not restored properly or corrupted.
> I added some checks to splraiseipl() and splx(), including in splx():
> if ((int)x < 0 || (int)x >= NIPL) { \
> printf("splx(%d)\n", (int)x); \
> panic("splx()"); \
> } \
>
> This does fire quite fast after some activity (within minutes). x did have
> -1 in the instance where I did print x's value (in previous attempts this
> was just a KASSERT).
> splx() was always called from mutex_vector_exit() via MUTEX_SPIN_SPLRESTORE()
> .
> looking at the lock value from ddb, mtxs_ipl did have the right value.
> The other CPU was always in the process of aquiring a lock.
> To me it looks like a hardware bug in the bus-locked operations which
> cause adjacent values to appear corrupted to the other CPU, maybe for
> a short time. Another possibility is register corrution between the 2
> threads.
>
> Both server are stable with a kernel using only one CPU (but HT still enabled
> in BIOS).
>
> Did someone else notice something similar, or have informations about
> such bug ?
I'd like to think that whatever issue you're seeing is the same one I
have... Would disabling hyperthreading help at all to provide a
datapoint? (After an uptime of 330+ days, having 3 hangs in a week
isn't giving me warm, fuzzy feelings :-/ The machine in question
is an IBM x336 that has been rock-stable under 3.0)
Later...
Greg Oster
Home |
Main Index |
Thread Index |
Old Index