Subject: Re: SS20 SMP panic
To: None <port-sparc@NetBSD.org>
From: Tillman Hodgson <tillman@seekingfire.com>
List: port-sparc
Date: 01/17/2005 07:48:38
On Sun, Jan 16, 2005 at 12:02:45AM -0600, Tillman Hodgson wrote:
> The last thing logged in /var/log/messages:
> 
> Jan 15 23:17:06 surya /netbsd: Async registers (mid 9): afsr=3c00<SE,UC,TO,BE,AFA=0>; afva=0x00
> Jan 15 23:17:06 surya /netbsd: Async registers (mid 8): afsr=3c00<SE,UC,TO,BE,AFA=0>; afva=0x00
> Jan 15 23:17:06 surya /netbsd: nmi_hard: SMP botch.cpu0: NMI: system interrupts: 10090000<VME=0,SBUS=0,E,T,M>

The machine died again at 3:13 last night:

Jan 17 03:15:01 surya /netbsd: Async registers (mid 9): afsr=3c00<SE,UC,TO,BE,AFA=0>; afva=0x00
Jan 17 03:15:01 surya /netbsd: Async registers (mid 8): afsr=3c00<SE,UC,TO,BE,AFA=0>; afva=0x00
Jan 17 03:15:01 surya /netbsd: cpu0: NMI: system interrupts: 10080000<VME=0,SBUS=0,T,M>
Jan 17 03:15:01 surya /netbsd: memory error:

Oddly, it was still responding to pings.

The 3:15 time is suspicious because that's when the daily scripts run,
and I have a /etc/daily.local that performs a backup on the disk via
a gzip'ed tar to a second disk. Gzip would use a lot of CPU time, so
both times have been when the CPUs are very busy.

I'm not sure what to make of the error messages. A memory error would
seem to be a different thing than a CPU problem, unless perhaps it's a
cache problem. In any case, if anyone can read the error messages well
enough to tell whether I ought to be pulling RAM sticks until the
problem goes away or swapping the CPUs out I'd be most appreciative :-)

-T


-- 
Zen is the unsymbolization of the world.
	R.H. Blyth