Subject: kernel stack overflow due to deep interrupt nesting
To: None <port-mips@netbsd.org, port-sgimips@netbsd.org>
From: Rafal Boni <rafal@attbi.com>
List: port-mips
Date: 04/05/2002 12:37:18
Folks:
	I've finally had a few minutes of quiet to chase down a problem I've
	been staring at on-and-off for a while: with a lot of output going 
	to the serial console & the console running at high speeds (38.4kbps
	in this case), my sgi kernels would generally fall over in several
	rather brutal ways (usually cache error panics or something else
	really non-intuitive).

	I finally tracked down the problem to a kernel stack overflow due to
	too deep interrupt nesting... Here's a backtrace (the panic is a
	check I added to make tracking this down a bit easier):


panic: cpu_intr: max_intr_depth too high: 16
Stopped at      0x8815ee64:     jr      ra
                bdslot: nop
db> tr
cpu_Debugger+4 (8ffff000,d,0,0) ra 88099290 sz 0
panic+124 (881ba894,36,0,6) ra 88193390 sz 40
cpu_intr+84 (881ba894,36,0,6) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,20) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,20) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,20) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,33) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,33) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,33) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,34) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,34) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,34) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,3a) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,3a) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,3a) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,39) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,39) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,39) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,30) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,30) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,30) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,20) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,20) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,20) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,32) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,32) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,32) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,32) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,32) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,32) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,20) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,20) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,20) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,62) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,62) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,62) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,65) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,65) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,65) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,46) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,46) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,46) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,17,bfa00000,20) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,17,bfa00000,20) ra 881933b0 sz 64
cpu_intr+a4 (fc01,17,bfa00000,20) ra 8815d79c sz 32
mips3_KernIntr+84 (fc01,1,bfa00000,4) ra 8818f7d8 sz 128
ip22_intr+1cc (fc01,1,bfa00000,4) ra 881933b0 sz 64
cpu_intr+a4 (fc01,1,bfa00000,4) ra 8815d79c sz 32
mips3_KernIntr+84 (ca554000,0,97ba,881f97a0) ra 8806928c sz 128
cpu_switch+64 (ca554000,0,97ba,881f97a0) ra 8808dabc sz 24
mi_switch+278 (ca554000,0,97ba,881f97a0) ra 8808d088 sz 48
ltsleep+244 (ca554000,0,97ba,881f97a0) ra 880d2320 sz 48
sched_sync+24c (ca554000,0,97ba,881f97a0) ra 8815de30 sz 104
mips3_proc_trampoline+8 (ca554000,0,97ba,881f97a0) ra 0 sz 0
User-level: curproc NULL
db> reboot 8
syncing disks... trap: TLB miss (load or instr. fetch) in kernel mode
status=0xff02, cause=0x8408, epc=0x880caa48, vaddr=0x0
curproc == NULL ksp=0xca554aa8
Stopped at      0x880caa48:     lw      v1,264(a0)

	The problem is (and this can probably also happen on any other
	MIPS port that uses a platform-specific IO interrupt handler
	since many do the same thing) that interrupts are generally
	turned on in the platform-specific IO interrupt handler, which
	can cause it to be interrupted to service new interrupts, etc.
	etc.

	Note the second and subsequent mips3_KernIntr invocations all
	happen to come from the same address; that address (`0x8818f7d8')
	is the next instuction after the call to:

	    _splset((status & ~cause & MIPS_HARD_INT_MASK) | MIPS_SR_INT_IE);

	at the end of ip22_intr() in sgimips/sgimips/ip22.c.

	I can think of several possible solutions, not none seem to be very
	good to me, so I thought I'd toss this out here and see if people
	have any better ideas.

	Potential `solutions' that I'm not too happy with include:

	  * Enlarging the size of the kernel stack in hopes of avoiding
	    this.  Not sure how deeply nested we can get, though, so I
	    don't know how many more pages we'd need to set up.

	  * Making the interrupt routine non-reentrant by not frobbing the
	    interrupt masks internally and hoping it gets taken care of by
	    the return-from-interrupt restoring the SR and interrupt masks.
	    This seems a little draconian.

	  * Frobbing the mips-generic interrupt code to look while there are
	    pending interrupts to avoid taking additional exceptions and then
	    only restoring interrupt masks after exiting from the loop.  This
	    is probably the least repulsive to me, but needs to touch generic
	    code, which I'm loath to do at this point in the proximity to the
	    1.6 release being branched 8-/

Any thoughts, ideas, etc. appreciated,
--rafal

----
Rafal Boni                                                     rafal@attbi.com
  We are all worms.  But I do believe I am a glowworm.  -- Winston Churchill