Subject: Other bugfixing before MP bugfixing?
To: None <port-sparc64@NetBSD.org>
From: Havard Eidnes <he@NetBSD.org>
List: port-sparc64
Date: 11/23/2005 19:30:21
Hi,
I recently set up a "crash-and-burn" box for sparc64 MP testing,
primarily for the purpose of seeing where the first problem hits.
Unsurprisingly, there are problems, and the kernel doesn't get
very far.
My test machine has:
NetBSD 3.99.11 (GENERIC.MP) #0: Sun Nov 20 12:34:05 CET 2005
he@quattro.urc.uninett.no:/u/build/HEAD/obj/sparc64/sys/arch/sp=
arc64/compile/GENERIC.MP
total memory =3D 256 MB
avail memory =3D 239 MB
bootpath: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0
mainbus0 (root): SUNW,Ultra-2: hostid 8082f611
cpu0 at mainbus0: SUNW,UltraSPARC @ 199.988 MHz, version 0 FPU
cpu0: 32K instruction (32 b/l), 16K data (32 b/l), 1024K external (64 b=
/l)
cpu1 at mainbus0: SUNW,UltraSPARC @ 199.988 MHz, version 0 FPU
cpu1: 32K instruction (32 b/l), 16K data (32 b/l), 1024K external (64 b=
/l)
The problem strikes at:
root on sd0a dumps on sd0b
mountroot: trying lfs...
mountroot: trying ffs...
root file system type: ffs
cpu_args @ 0xfff8c000
ktext 1000000, ktextp 27800000, ektext 15a0000
kdata 1800000, kdatap 27400000, ekdata 18cc000
mp_start fff8c050, mp_start_size 0x50
cpu0: booting secondary processors:
node f0066518, cpuinfo 27e30000, initstack 0xe0020000
cpu1 now spinning idle (waited 1 iterations)
trap type 0x34: pc=3D13ab53c npc=3D13ab540 pstate=3D820006<PRIV,IE>
kernel trap 34: mem address not aligned
cpu0 paused.
Stopped in pid 1.1 (init) at netbsd:cc_microtime+0x19c: ldx [%fp + 0x7=
e7], %g1
db{1}> =
db{1}> ps
PID PPID PGRP UID S FLAGS LWPS COMMAND=
WAIT
6 0 0 0 2 0x20200 1 aiodoned=
aiodone
5 0 0 0 2 0x20200 1 ioflush=
syncer
4 0 0 0 2 0x20200 1 pagedaemon=
pgdaemo
3 0 0 0 2 0x20200 1 scsibus0=
sccomp
2 0 0 0 2 0x20200 1 cryptoret=
crypto_
>1 0 0 0 2 0x20000 1 init=
0 -1 0 0 2 0x20200 1 swapper=
schedul
db{1}>
I've been told that %fp =3D=3D %o6, and the registers show:
db{1}> show reg
tstate 0x82000605
pc 0x13ab53c cc_microtime+0x19c
npc 0x13ab540 cc_microtime+0x1a0
ipl 0
y 0x2a
g0 0
g1 0
g2 0xffffffffffffffff
g3 0xb758000
g4 0
g5 0x181ac00 spinlock_list_slock+0x18
g6 0xffff0000
g7 0xfffeeff0
o0 0x1
o1 0
o2 0x38c
o3 0x1557e08 copyright+0x34610
o4 0x25800ad120024224
o5 0x800000004004004
o6 0xb6df2e1
o7 0x13ab530 cc_microtime+0x190
...
However, %o6 + 0x7e7 is as far as I can see evenly divisible by
8, so I think the diagnostic spewed by the kernel must be
misleading.
Further, disassembly of cc_microtime reveals that 0x19c is near
the end of the function, where we assign the value via the
pointer given as argument:
db{1}> x/i,20
netbsd:cc_microtime+0x174: stx %g3, [%fp + 0x7e7]
netbsd:cc_microtime+0x178: ldx [%fp + 0x7df], %g1
netbsd:cc_microtime+0x17c: or %l3, 0xe8, %o0
netbsd:cc_microtime+0x180: or %l2, 0x190, %o1
netbsd:cc_microtime+0x184: ldx [%fp + 0x7e7], %g2
netbsd:cc_microtime+0x188: or %g0, 0x9b, %o2
netbsd:cc_microtime+0x18c: stx %g1, [%g5 + 0x288]
netbsd:cc_microtime+0x190: call netbsd:_simple_unlock
netbsd:cc_microtime+0x194: stx %g2, [%g5 + 0x290]
netbsd:cc_microtime+0x198: wrpr %g0, %l1, %pil
netbsd:cc_microtime+0x19c: ldx [%fp + 0x7e7], %g1
netbsd:cc_microtime+0x1a0: ldx [%fp + 0x7df], %g2
netbsd:cc_microtime+0x1a4: stx %g1, [%i0 + 0x8]
netbsd:cc_microtime+0x1a8: stx %g2, [%i0 + %g0]
netbsd:cc_microtime+0x1ac: return [%i7 + 0x8]
netbsd:cc_microtime+0x1b0: nop
The corresponding C code appears to be:
if (sec =3D=3D 0 && usec > 0) {
t.tv_usec +=3D usec + 1;
if (t.tv_usec >=3D 1000000) {
t.tv_usec -=3D 1000000;
t.tv_sec++;
}
}
lasttime =3D t;
0x190 simple_unlock(µtime_slock);
0x198 splx(s);
0x19c-0x1a8 *tvp =3D t;
}
You will also note that [%fp + 0x7e7] has been stored into just a
few instructions before the faulting instruction -- further
indication that the trap diagnostic from the kernel must be
misleading.
The sparc64 port's ddb does also not appear to have the ability
to look at the registers for the other CPU -- "machine cpu" does
not take an argument, and only shows information about the CPUs:
db{1}> machine cpu 0
cpu0: self 0x01c14000 lwp 0x00000000 pcb 0x0b6dc000
cpu1: self 0x0b758000 lwp 0x0b795e00 pcb 0x0b6dc000
db{1}>
db{1}> machine cpu =
cpu0: self 0x01c14000 lwp 0x00000000 pcb 0x0b6dc000
cpu1: self 0x0b758000 lwp 0x0b795e00 pcb 0x0b6dc000
db{1}>
...and I don't know how to make use of that information to see if
there really was a problem flagged on CPU 0.
So... Am I right in guessing that there are a few other problems
which needs to be solved before we can make any progress on the
actual MP problems?
Regards,
- H=E5vard