Re: Ultrasparc III+ kernel panic

To: BERTRAND Joël <joel.bertrand%systella.fr@localhost>
Subject: Re: Ultrasparc III+ kernel panic
From: Eduardo Horvath <eeh%NetBSD.org@localhost>
Date: Wed, 1 Apr 2015 15:57:16 +0000 (UTC)

On Wed, 1 Apr 2015, BERTRAND Joël wrote:

> 	Hello,
> 
> 	New panic last night...
> 
> 1 tt=30 tstate=4411001505 tpc=0x1001488 tnpc=0x100148c
> 2 tt=30 tstate=4482000603 tpc=0x12e1da0 tnpc=0x12e1da4
> 
> Debug information :
> (gdb) list *(0x1001488)
> (gdb) x/i 0x1001488
>    0x1001488 <uspillk4+8>:      sta  %l0, [ %sp ] %asi
> (gdb) list *(0x12e1da0)
> 0x12e1da0 is in mutex_vector_enter (/usr/src/sys/kern/kern_mutex.c:440).
> 435      *      fast-path stubs are available.  If an mutex_spin_enter() stub
> is
> 436      *      not available, then it is also aliased directly here.
> 437      */
> 438     void
> 439     mutex_vector_enter(kmutex_t *mtx)
> 440     {
> 441             uintptr_t owner, curthread;
> 442             turnstile_t *ts;
> 443     #ifdef MULTIPROCESSOR
> 444             u_int count;
> (gdb) x/i 0x12e1da0
>    0x12e1da0 <mutex_vector_enter>:      save  %sp, -176, %sp
> 
> mach stack does not return usable information. Only :
> db{0} > mach stack
> Window 0 frame64 0xe004ff50 locals, ins:
> 10426baa0 0 15a068000 1044914d0 fffffffffefa2000 0 102cfafd0 180f680
> 0 0 0 0 0 0 ffffffffffffa011=sp fffffffffed6d200=pc:fffffffffed6d200
> Window 1 frame64 0xffffffffffffa810 locals, ins:
> 
> 	You can see that this panic is exactly the same than last panic.

I looked at the archives and it doesn't look like I commented on this 
previously.

I'm assuming the trap stack is semi-accurate.  The save instruction should 
not be able to generate a data access fault, but then the low level bits 
of locore.s do some interesting gymnastics with the trap stack to prevent 
loss of data, so it may have moved things around.

uspillk4 is used to save alternate space register windows to the stack.  
The order of operations is:

1) The CPU is running userland code and traps into the kernel.  

2) The kernel switches to the kernel stack and moves the contents of 
%canrestore to %otherwin to indicate those register windows are not of the 
current address space.

3) The kernel does some stuff and eventually calls mutex_vector_enter().

4) mutex_vector_enter() needs a new register window, so it does a save.

5) The register windows are full, so the CPU takes a store window trap.  
Since %otherwin is not zero, it goes to uspillk4 to save other address 
space windows instead of kspill4.

6) The trap handler tries to save the window and takes a data fault.

7) The data fault handler punts.

What should happen is:

The CPU takes a save fault at trap level 1.

It takes a data fault at trap level 2.

The data fault handler jumps to winfault.  winfault will look at the 
current trap level.  Since it's not 1, it executes some fancy code to 
fiddle with the trap stack and figure out what's really happening.  It 
should detect a fault during a spill and go to winfixspill.

winfixspill code should save all the otherwin windows to slots in the PCB, 
and then continue executing kernel code.

Eventually, when returning to userland, the trap return code will restore 
all the userland windows from the PCB and return to userland code.

winfix has a bunch of diagnostic code still enabled.  You do not seem to 
be hitting any of the sir instructions sprinkled in the code that would 
reset the box.  

There's still a lot of debug and diagnostic code in there.  You might want 
to try turning some of the NOT_DEBUG or NOTDEF_DEBUG code on.

Also, look for calls to panic.  Line 2149 there's a ta 1, which will cause 
a trap, before the call to panic.  That made sense when the kernel still 
had traptrace, since that would generate a traptrace entry before all hell 
broke loose.  Now it probably just makes things worse.  Try removing it 
to really call panic there, or changing it to an sir instruction to 
generate a reset.

There's another ta 1 on line 2306 to trap to the debugger.  Since trapping 
to ddb is not reliable in this situation, change it to an sir instruction.

Anyway, you probably need to instrument that code path to see where it's 
geting confused.

And keep in mind that code is semi-recursive in that you can take a 
datafault trying to clean up state to take a data fault.

Eduardo

Follow-Ups:
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël

References:
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Martin Husemann
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Takeshi Nakayama
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Martin Husemann
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: Martin Husemann
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël
- Re: Ultrasparc III+ kernel panic
  - From: BERTRAND Joël

Prev by Date: Re: Ultrasparc III+ kernel panic
Next by Date: Re: Ultrasparc III+ kernel panic
Previous by Thread: Re: Ultrasparc III+ kernel panic
Next by Thread: Re: Ultrasparc III+ kernel panic
Indexes:

Home | Main Index | Thread Index | Old Index