port-sparc64: Other bugfixing before MP bugfixing?

Subject: Other bugfixing before MP bugfixing?
To: None <port-sparc64@NetBSD.org>
From: Havard Eidnes <he@NetBSD.org>
List: port-sparc64
Date: 11/23/2005 19:30:21
Hi,

I recently set up a "crash-and-burn" box for sparc64 MP testing,
primarily for the purpose of seeing where the first problem hits.

Unsurprisingly, there are problems, and the kernel doesn't get
very far.


My test machine has:

NetBSD 3.99.11 (GENERIC.MP) #0: Sun Nov 20 12:34:05 CET 2005
        he@quattro.urc.uninett.no:/u/build/HEAD/obj/sparc64/sys/arch/sp=
arc64/compile/GENERIC.MP
total memory =3D 256 MB
avail memory =3D 239 MB
bootpath: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0
mainbus0 (root): SUNW,Ultra-2: hostid 8082f611
cpu0 at mainbus0: SUNW,UltraSPARC @ 199.988 MHz, version 0 FPU
cpu0: 32K instruction (32 b/l), 16K data (32 b/l), 1024K external (64 b=
/l)
cpu1 at mainbus0: SUNW,UltraSPARC @ 199.988 MHz, version 0 FPU
cpu1: 32K instruction (32 b/l), 16K data (32 b/l), 1024K external (64 b=
/l)


The problem strikes at:

root on sd0a dumps on sd0b
mountroot: trying lfs...
mountroot: trying ffs...
root file system type: ffs
cpu_args @ 0xfff8c000
ktext 1000000, ktextp 27800000, ektext 15a0000
kdata 1800000, kdatap 27400000, ekdata 18cc000
mp_start fff8c050, mp_start_size 0x50
cpu0: booting secondary processors:
node f0066518, cpuinfo 27e30000, initstack 0xe0020000
cpu1 now spinning idle (waited 1 iterations)

trap type 0x34: pc=3D13ab53c npc=3D13ab540 pstate=3D820006<PRIV,IE>
kernel trap 34: mem address not aligned
cpu0 paused.
Stopped in pid 1.1 (init) at netbsd:cc_microtime+0x19c:  ldx [%fp + 0x7=
e7], %g1
db{1}> =

db{1}> ps
 PID           PPID     PGRP        UID S   FLAGS LWPS          COMMAND=
    WAIT
 6                0        0          0 2 0x20200    1         aiodoned=
 aiodone
 5                0        0          0 2 0x20200    1          ioflush=
  syncer
 4                0        0          0 2 0x20200    1       pagedaemon=
 pgdaemo
 3                0        0          0 2 0x20200    1         scsibus0=
  sccomp
 2                0        0          0 2 0x20200    1        cryptoret=
 crypto_
>1                0        0          0 2 0x20000    1             init=

 0               -1        0          0 2 0x20200    1          swapper=
 schedul
db{1}>

I've been told that %fp =3D=3D %o6, and the registers show:

db{1}> show reg
tstate      0x82000605
pc          0x13ab53c   cc_microtime+0x19c
npc         0x13ab540   cc_microtime+0x1a0
ipl         0
y           0x2a
g0          0
g1          0
g2          0xffffffffffffffff
g3          0xb758000
g4          0
g5          0x181ac00   spinlock_list_slock+0x18
g6          0xffff0000
g7          0xfffeeff0
o0          0x1
o1          0
o2          0x38c
o3          0x1557e08   copyright+0x34610
o4          0x25800ad120024224
o5          0x800000004004004
o6          0xb6df2e1
o7          0x13ab530   cc_microtime+0x190
...

However, %o6 + 0x7e7 is as far as I can see evenly divisible by
8, so I think the diagnostic spewed by the kernel must be
misleading.

Further, disassembly of cc_microtime reveals that 0x19c is near
the end of the function, where we assign the value via the
pointer given as argument:

db{1}> x/i,20
netbsd:cc_microtime+0x174:      stx             %g3, [%fp + 0x7e7]
netbsd:cc_microtime+0x178:      ldx             [%fp + 0x7df], %g1
netbsd:cc_microtime+0x17c:      or              %l3, 0xe8, %o0
netbsd:cc_microtime+0x180:      or              %l2, 0x190, %o1
netbsd:cc_microtime+0x184:      ldx             [%fp + 0x7e7], %g2
netbsd:cc_microtime+0x188:      or              %g0, 0x9b, %o2
netbsd:cc_microtime+0x18c:      stx             %g1, [%g5 + 0x288]
netbsd:cc_microtime+0x190:      call            netbsd:_simple_unlock
netbsd:cc_microtime+0x194:      stx             %g2, [%g5 + 0x290]
netbsd:cc_microtime+0x198:      wrpr            %g0, %l1, %pil
netbsd:cc_microtime+0x19c:      ldx             [%fp + 0x7e7], %g1
netbsd:cc_microtime+0x1a0:      ldx             [%fp + 0x7df], %g2
netbsd:cc_microtime+0x1a4:      stx             %g1, [%i0 + 0x8]
netbsd:cc_microtime+0x1a8:      stx             %g2, [%i0 + %g0]
netbsd:cc_microtime+0x1ac:      return          [%i7 + 0x8]
netbsd:cc_microtime+0x1b0:      nop

The corresponding C code appears to be:

		   if (sec =3D=3D 0 && usec > 0)  {
			   t.tv_usec +=3D usec + 1;
			   if (t.tv_usec >=3D 1000000) {
				   t.tv_usec -=3D 1000000;
				   t.tv_sec++;
			   }
		   }
                   lasttime =3D t;
0x190              simple_unlock(&microtime_slock);

0x198              splx(s);

0x19c-0x1a8        *tvp =3D t;
}

You will also note that [%fp + 0x7e7] has been stored into just a
few instructions before the faulting instruction -- further
indication that the trap diagnostic from the kernel must be
misleading.

The sparc64 port's ddb does also not appear to have the ability
to look at the registers for the other CPU -- "machine cpu" does
not take an argument, and only shows information about the CPUs:

db{1}> machine cpu 0
cpu0: self 0x01c14000 lwp 0x00000000 pcb 0x0b6dc000
cpu1: self 0x0b758000 lwp 0x0b795e00 pcb 0x0b6dc000
db{1}>
db{1}> machine cpu =

cpu0: self 0x01c14000 lwp 0x00000000 pcb 0x0b6dc000
cpu1: self 0x0b758000 lwp 0x0b795e00 pcb 0x0b6dc000
db{1}>

...and I don't know how to make use of that information to see if
there really was a problem flagged on CPU 0.

So... Am I right in guessing that there are a few other problems
which needs to be solved before we can make any progress on the
actual MP problems?

Regards,

- H=E5vard