Subject: Re: Isolating NMI/memory problem with old SPARCserver 20
To: None <port-sparc@NetBSD.org>
From: Greg Earle <earle@isolar.DynDNS.ORG>
List: port-sparc
Date: 08/06/2005 14:26:28
On Aug 6, 2005, at 1:25 PM, Havard Eidnes wrote:
>> And, if so, how do I map it to the bad DIMM module?
>
> Upgrade to a newer version of NetBSD? ;-)
Its next "upgrade" will be to retirement, as soon as I
move its functions over to my dual-450 Ultra 60. So that
isn't really an option ...
>> (Update: I just saw an old post to port-sparc from May 9th from
>> Malte Dehling; he reported a similar error, but his log also
>> shows a "module location: " identifier? Mine doesn't - is this
>> a new reporting feature in NetBSD 2.0 or something?)
>
> Yes. The code was added in revision 1.8 of memecc.c on 22 Mar 2004.
>
> Index: memecc.c
> ===================================================================
> RCS file: /u/nb/src/sys/arch/sparc/sparc/memecc.c,v
> retrieving revision 1.7
> retrieving revision 1.8
> diff -u -r1.7 -r1.8
> --- memecc.c 15 Jul 2003 00:05:06 -0000 1.7
> +++ memecc.c 22 Mar 2004 12:37:43 -0000 1.8
> @@ -142,6 +142,8 @@
> printf("\tMBus transaction: %s\n",
> bitmask_snprintf(efar0, ECC_AFR_BITS, bits,
> sizeof(bits)));
> printf("\taddress: 0x%x%x\n", efar0 & ECC_AFR_PAH, efar1);
> + printf("\tmodule location: %s\n",
> + prom_pa_location(efar1, efar0 & ECC_AFR_PAH));
>
> /* Unlock registers and clear interrupt */
> bus_space_write_4(memecc_sc->sc_bt, bh, ECC_FSR_REG, efsr);
>
> However, that came with another set of changes, the
> source-changes message was:
>
> Module Name: src
> Committed By: pk
> Date: Mon Mar 22 12:37:43 UTC 2004
>
> Modified Files:
> src/sys/arch/sparc/include: promlib.h
> src/sys/arch/sparc/sparc: memecc.c memreg.c promlib.c
>
> Log Message:
> Leverage the PROM's ability to identify the on-board location of a
> physical memory address.
>
> To generate a diff of this commit:
> cvs rdiff -r1.18 -r1.19 src/sys/arch/sparc/include/promlib.h
> cvs rdiff -r1.7 -r1.8 src/sys/arch/sparc/sparc/memecc.c
> cvs rdiff -r1.37 -r1.38 src/sys/arch/sparc/sparc/memreg.c
> cvs rdiff -r1.31 -r1.32 src/sys/arch/sparc/sparc/promlib.c
>
> For a quick try, you could perhaps try to add those changes to
> your local source tree and run that kernel? (It's not a given
> that this doesn't depend on some other change, but it's worth a
> try.) That is, if your machine stays up long enough for you to
> patch and compile a new kernel...
It stays up; these aren't fatal. And they're only occasional.
More of a problem is the fact that my versions of these 4 files
are ancient compared to the ones you've mentioned:
==> src/sys/arch/sparc/include/promlib.h <==
/* $NetBSD: promlib.h,v 1.4 2001/09/26 20:53:07 eeh Exp $ */
==> src/sys/arch/sparc/sparc/memecc.c <==
/* $NetBSD: memecc.c,v 1.3 2002/03/11 16:27:04 pk Exp $ */
==> src/sys/arch/sparc/sparc/memreg.c <==
/* $NetBSD: memreg.c,v 1.32 2002/03/11 16:27:04 pk Exp $ */
==> src/sys/arch/sparc/sparc/promlib.c <==
/* $NetBSD: promlib.c,v 1.13 2001/12/07 11:00:39 hannken Exp $ */
So I'm a bit afraid that these 4 diffs won't just drop right in ...
(I suppose I can try it and see, though)
>> I've got 3 64 MB DIMMs (in banks 0, 1 and 5) for a total of 192 MB,
>> so I could live without one of 'em temporarily ... what's weird is
>> that I did a "test-memory" from the boot PROM (with "selftest-#megs?"
>> set to all 192 MB) as well as booting in diag mode and having it test
>> memory there as well, and it didn't hiccup on that address ...
>
> It's not certain that the memory test in the prom is all that
> thorough. It could also be heat-related, as someone else
> commented.
The odd thing about that (thanks for the suggestions, btw) is
that the machine is in a small room with a window unit A/C,
so theoretically it should never be getting all that hot.
Thanks,
- Greg