Eduardo Horvath a écrit :
On Mon, 13 Apr 2015, BERTRAND Joël wrote:I have seen. And I have seen another panic : panic: cpu1: ipi_send: couldn't send ipi to UPAID 0 (tried 10000 times) cpu1: Begin traceback... cpu1: End traceback... Frame pointer is at 0x2004e41 Call traceback: netbsd:cpu_reboot+0x208(182f828, 1, ffff, 77bb78, 1cce380, 1c97000) fp = 2004f01 netbsd:vpanic+0x178(104, 0, 1852638, 1cb6800, f, 1c70740) fp = 2004fb1 netbsd:panic+0x24(1852638, 20059a8, 1cdc800, 1cddaf8, 1cddc00, 104) fp = 2005061 netbsd:sparc64_send_ipi_sun4u+0x1ac(1852638, 1, 0, 2710, fffffffffffffffe, 0) fp = 2005121 netbsd:cpu_need_resched+0x54(f4240, 1018a80, 0, 0, 70, 0) fp = 20051d1 netbsd:sched_changepri+0x64(2014000, 2, 2014000, 101db1d08, 101db1040, 2a) fp = 2005281 netbsd:resetpriority+0x90(1043816c0, 2a, 0, 1, 101daec40, 101daedc0) fp = 2005331 netbsd:sched_pstats+0x118(1043816c0, 0, 1c70868, 0, 10caf5510, 2a) fp = 20053e1 netbsd:uvm_scheduler+0x60(64, 1c71000, 0, 101daedc0, 10caf5510, 1043816c0) fp = 2005491 netbsd:main+0x83c(101d89f00, 1c70740, 1c70740, 101da2c80, 1c0a1fc, 18a0598) fp = 2005541 netbsd:cpu_initialize+0x154(184d500, 10624dd3, 1c97800, 0, 101daee00, 1) fp = 2005621 netbsd:100030+0(f0059840, 113800, 113c00, 111880, 111ce8, 1117f8) fp = fff33651 dumping to dev 25,1 offset 12291071 But I don't understand. With the same kernel, this Blade2000 rebooted one or more times _by day_ and now, uptime is greater than 8 days. I have saved kernel image and core if you want.Well that's not terribly useful. One CPU tried to tell another CPU something but the other CPU did not respond. It then paniced. In this circumstance the interesting info is the state of the unresponsive CPU. An SIR would be much more useful in this circumstance than a panic.
Hello,Some good news. Before patching locore.s with your suggestions, I have rebuilt a 7.99.9 kernel from sources (with userland) and I have planned to investigate last saturday. This kernel 7.99.9 is stable on my blade 2000. I have obtained an uptime greater than 6 days (and system has finally crashed when I have tried to do /etc/rc.d/altqd restart... but it is not the same issue). With 7.99.6, same condition, same blade 2000 paniced one or two times by day. I haven't seen any modification in sparc64/sparc64 nor sparc64/dev that can explain that 7.99.9 is stable and that 7.99.6 wasn't.
Thus, I have rebuilt a 7.99.12 from sources and tda.c seems to be broken. In dmesg, tda.c writes :
tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values tda0: skipping temp adjustment - no sensor values and envstat only returns : envstat: no drivers registered but fans do not run at maximal speed. Best regards, JKB