Subject: port-sun3/4691: sun3 ECC error reporting works, but error is not cleared and system loops forever
To: None <gnats-bugs@gnats.netbsd.org>
From: None <woods@sometimes.weird.com>
List: netbsd-bugs
Date: 12/15/1997 12:47:11
>Number: 4691
>Category: port-sun3
>Synopsis: sun3 ECC error reporting works, but error is not cleared and system loops forever
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: gnats-admin (GNATS administrator)
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Dec 15 09:50:02 1997
>Last-Modified:
>Originator: Greg A. Woods
>Organization:
Greg A. Woods
+1 416 443-1734 VE3TCP <gwoods@acm.org> <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
>Release: NetBSD-current 1997/12/14
>Environment:
System: NetBSD sometimes 1.3_ALPHA NetBSD 1.3_ALPHA (MOUSETRAP) #3: Wed Dec 3 12:38:27 EST 1997 woods@sometimes:/var/usr.src/sys/arch/sun3/compile/MOUSETRAP sun3
>Description:
ECC memory, and memory error detection in general, is a very critical
and important issue for me. So I decided I would enable ECC correctable
error interrupts on my test machine since I had already observed the CE
lamp lit on one of the boards in the system and I wanted to make sure
that the system logs noted this event as well.
So I did so, and today I found the machine in a tight loop forever
reporting the same error, with no keyboard or other I/O response:
Memory error on CPU cycle!
ctx=4, vaddr=0xe5f8007, paddr=0x1254000
csr=d1<IPEND,IENA,CE_ENA,CE>
Unfortunately I have the PROM set to cause a system reset on watchdog
reset. (With the other setting does a watchdog drop into the DDB
otherwise, or straight to the PROM? If the latter is there a way to
get to the DDB from the PROM?)
>How-To-Repeat:
Find a memory board that has occasional correctable errors and install it.
Apply the following patch to /usr/src/sys/arch/sun3/dev/memerr.c and
build a new kernel:
11:53 [1233] # diff memerr.c-1.8 memerr.c
165c165
< mer->me_csr = ME_CSR_IENA; /* | ME_ECC_CE_ENA */
---
> mer->me_csr = ME_CSR_IENA | ME_ECC_CE_ENA;
Boot the new kernel and wait for the memory error to occur.
Observe that the system is locked in an interrupt loop reporting the
correctable error and that the CE LED is lit on the board.
Hit the watchdog reset button to break the loop and either drop to PROM or
reset the system.
>Fix:
Observe that the code in memerr.c that claims it will reset the error
doesn't do so:
recover:
/* Clear the error by writing the address register. */
me->me_vaddr = 0;
return (1);
I would guess that the ECC boards need to be told directly that the error
has been handled and the interrupt should be disabled. I've looked at
the code for sun4/200 support of ECC memory sent to me by Chuck Cranor, but
it looks like a whole lot more work needs to be done to make that code
fit the sun3 framework. Perhaps in fact the sun3 framework should be
warped to match the sun4/200 framework. For example Chuck's code does
special things to probe the memory boards during boot to determine their
size, starting address, etc.
>Audit-Trail:
>Unformatted: