Eduardo Horvath a écrit :
On Mon, 23 Feb 2015, BERTRAND Joël wrote:Bad news. Last night, this server panics twice when idle with : Feb 23 07:58:27 legendre /netbsd: cpu0: data fault: pc=f000934c rpc=103b435e0 addr=1ffee8000 Feb 23 07:58:27 legendre /netbsd: Skipping crash dump on recursive panic Feb 23 07:58:27 legendre /netbsd: panic: kernel fault Feb 23 07:58:27 legendre /netbsd: cpu0: Begin traceback... Feb 23 07:58:27 legendre /netbsd: cpu0: End traceback... Feb 23 07:58:27 legendre /netbsd: cpu1: shutting down Feb 23 07:58:27 legendre /netbsd: cpu0: rebootingHm. From what I remember, f000xxxx is inside OBP. Instead of randomly swapping out hardware you really should try to diagnose the problem.
Julian Coleman sent me a private message. He's suspected a bug in UPA subsystem. Thus, I have replaced UPA Creator3D by a PCI adapter.
I'd turn on ddb and traptrace in the kernel and examine the contents of the traptrace buffer after the panic. That should tell us the sequence of traps that caused the panic.
I'm trying, but it is not very easy. I have several Blade2000 and this bug seems to be triggered by something on PCI/UPA bus. At home, I have installed the same configuration (without external U320 disks, but with mpt adapter) and this server seems to be stable. I'm working on cpufreq and I haven't seen panic or deadlock for a long time.
Faulty servers are far away and I don't have any serial console, only SSH access.
Regards, JKB