Subject: parallel make locking up (on amd64)
To: None <current-users@netbsd.org>
From: Kurt Schreiner <ks@ub.uni-mainz.de>
List: port-amd64
Date: 07/12/2006 16:49:44
Hi,

"torturing" my shiny new Sun ultra40 I tried some "build.sh -j7" which run for
a while but eventually the make processes lock up WAITing on vnlock.
The lockup can (more or less) be reproduced by "reboot; login; build.sh -j7"...
Filesystems are setup as follows:

/dev/wd1g on /u type ffs (noatime, soft dependencies, local)
mfs:698 on /tmp type mfs (synchronous, nosuid, nodev, noatime, local)
<above>:/u/NetBSD/lsrc on /u/NetBSD/src.060711 type union (nosuid, nodev, local, mounted by ks)

parameters to build.sh are:

./build.sh -N 1 -j 7 -x -U -m amd64 -O /u/NetBSD/arch/amd64/obj \
 -D /u/NetBSD/arch/amd64/dest -T /u/NetBSD/arch/amd64/TOOLS

DDB (on serial console ;-) shows:

db{0}> ps
 PID           PPID     PGRP        UID S   FLAGS LWPS          COMMAND    WAIT
 8675             1     8675          0 2  0x4002    1            getty   ttyin
 3366          3573     7364         77 2  0x4002    1             less   ttyin
 3573          7364     7364         77 2  0x4002    1               sh    wait
 7364          3355     7364         77 2  0x4002    1              man    wait
 12189            1     7611         77 2  0x4002    1           nbmake  vnlock
 7935             1     7217         77 2  0x4002    1           nbmake  vnlock
 5841             1     3746         77 2  0x4002    1           nbmake  vnlock
 4683             1     3379         77 2  0x4002    1           nbmake  vnlock
 3095             1     1997         77 2  0x4002    1           nbmake  vnlock
 7294             1     5283         77 2  0x4002    1           nbmake  vnlock
 3355          2895     3355         77 2  0x4002    1             tcsh   pause
 2895          3517     3517         77 2   0x100    1             sshd  select
 3517           636     3517          0 2  0x4101    1             sshd   netio
 1207           918     1207         77 2  0x4002    1             tcsh   ttyin
 918            244      244         77 2   0x100    1             sshd  select
 244            636      244          0 2  0x4101    1             sshd   netio
 243              1      243          0 2  0x4002    1            getty   ttyin
 242              1      242          0 2  0x4002    1            getty   ttyin
 241              1      241          0 2  0x4002    1            getty   ttyin
 235              1      235          0 2       0    1             cron nanosle
 233              1      233          0 2       0    1            inetd  kqread


db{0}> trace/t 0t5841
trace: pid 5841  at 0xffff800057a326a0
ltsleep() at netbsd:ltsleep+0x3df
acquire() at netbsd:acquire+0x17d
lockmgr() at netbsd:lockmgr+0x367
VOP_LOCK() at netbsd:VOP_LOCK+0x25
vn_lock() at netbsd:vn_lock+0x99
cache_lookup() at netbsd:cache_lookup+0x2f9
ufs_lookup() at netbsd:ufs_lookup+0xdc
VOP_LOOKUP() at netbsd:VOP_LOOKUP+0x27
union_lookup1() at netbsd:union_lookup1+0x42
union_lookup() at netbsd:union_lookup+0xd9
VOP_LOOKUP() at netbsd:VOP_LOOKUP+0x27
lookup() at netbsd:lookup+0x296
namei() at netbsd:namei+0x16a
vn_open() at netbsd:vn_open+0x164
sys_open() at netbsd:sys_open+0xdd
syscall_plain() at netbsd:syscall_plain+0x122
kernel: page fault trap, code=0
Faulted in DDB; continuing...

db{0}> trace/t 0t7294
trace: pid 7294  at 0xffff8000581b9b60
ltsleep() at netbsd:ltsleep+0x3df
acquire() at netbsd:acquire+0x17d
lockmgr() at netbsd:lockmgr+0x680
VOP_LOCK() at netbsd:VOP_LOCK+0x25
vn_lock() at netbsd:vn_lock+0x99
union_lock() at netbsd:union_lock+0x7f
VOP_LOCK() at netbsd:VOP_LOCK+0x25
vn_lock() at netbsd:vn_lock+0x99
vn_readdir() at netbsd:vn_readdir+0xcb
sys___getdents30() at netbsd:sys___getdents30+0xaa
syscall_plain() at netbsd:syscall_plain+0x122
kernel: page fault trap, code=0
Faulted in DDB; continuing...

Is there anything I can do to help debugging this? Sendpr?

Kurt