NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/59037: deadlock in posix_spawn
>Number: 59037
>Category: kern
>Synopsis: deadlock in posix_spawn
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jan 29 16:15:00 +0000 2025
>Originator: Taylor R Campbell
>Release: NetBSD 10.0_STABLE
>Organization:
The NetPOSIXD Spawndation
>Environment:
Architecture: x86_64
Machine: amd64
NetBSD 10.0_STABLE (SHADOW) #0: Mon Oct 28 06:14:44 UTC 2024 spz%franklin.NetBSD.org@localhost:/home/netbsd/10/amd64/obj/sys/arch/amd64/compile/SHADOW
>Description:
Various processes stuck in tstile, vfork, wait, pipe_rd,
select, poll, &c.:
db{0}> ps /l
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
19214 19214 3 22 0 ffffd919f99c1a40 screen tstile
15719 15719 3 0 0 ffffd9171a2fa280 cron tstile
16226 16226 3 0 0 ffffd912be5c4bc0 cron vfork
23194 23194 3 0 0 ffffd9a699e51040 cron tstile
9594 9594 3 0 0 ffffd91219955a80 cron vfork
...
23011 23011 3 9 180 ffffd90b14a8b600 sh pipe_rd
8879 8879 3 16 180 ffffd908c7a44040 sh pipe_rd
20035 20035 3 9 0 ffffd918a06a3700 sh tstile
27233 27233 3 13 0 ffffd9de7d301040 sh tstile
23703 23703 3 22 180 ffffd903d350e780 sh wait
7301 7301 3 11 0 ffffd90860df9500 gcc vfork
...
These ones look interesting:
4075 4075 3 5 0 ffffd903f5336080 winebuild tstile
18591 18591 3 1 180 ffffd9884718ea80 sh wait
27904 27904 3 22 180 ffffd903d8719340 sh wait
28380 28380 3 19 180 ffffd90536102780 winebuild wait
The threads waiting in tstile are waiting on exec_lock, except
for the winebuild child process:
db{0}> show all tstiles
PID LID COMMAND WAITING-FOR WAIT-CHANNEL
19214 19214 screen 0 netbsd:exec_lock
15719 15719 cron 0 netbsd:exec_lock
23194 23194 cron 0 netbsd:exec_lock
...
23662 23662 R 0 netbsd:exec_lock
13798 13798 ghc 0 netbsd:exec_lock
9896 9896 ghc 0 netbsd:exec_lock
6801 6801 ghc 0 netbsd:exec_lock
466 466 ghc 0 netbsd:exec_lock
2432 2432 ghc 0 netbsd:exec_lock
23789 23789 ghc 0 netbsd:exec_lock
4075 4075 winebuild 0 ffffd903f7ac6a90
The winebuild child process appears to be waiting for
rw_enter(&p->p_reflock, RW_WRITER):
db{0}> bt/t 0t4075
trace: pid 4075 lid 4075 at 0xffff9284f9ca8df0
sleepq_block() at netbsd:sleepq_block+0x13a
turnstile_block() at netbsd:turnstile_block+0x3d2
rw_enter() at netbsd:rw_enter+0x18b
spawn_return() at netbsd:spawn_return+0x67
(gdb) print 0x67
$7 = 103
(gdb) disas spawn_return+103
...
0xffffffff80ce3018 <+98>: callq 0xffffffff8023fbc0 <rw_enter>
0xffffffff80ce301d <+103>: mov %gs:0xac0,%rax
...
2356 rw_enter(&p->p_reflock, RW_WRITER);
https://nxr.netbsd.org/xref/src/sys/kern/kern_exec.c?r=1.518.4.1#2356
The lock it's waiting for appears to be its own p_reflock:
(gdb) print (long)&((struct proc *)0)->p_pid - (long)&((struct proc *)0)->p_reflock
$8 = -612
db{0}> x/ld ffffd903f7ac6a90-0t612
ffffd903f7ac682c: 4075
It is held by a single reader with a writer waiting -- 0x23 is:
(1 << RW_READ_COUNT_SHIFT=5)
| RW_HAS_WAITERS=1
| RW_WRITE_WANTED=2
db{0}> x/Lx ffffd903f7ac6a90
ffffd903f7ac6a90: 23
The winebuild parent process is waiting for the child:
db{0}> bt/t 0t28380
trace: pid 28380 lid 28380 at 0xffff92867f69cc10
sleepq_block() at netbsd:sleepq_block+0x13a
cv_wait_sig() at netbsd:cv_wait_sig+0xbd
do_sys_waitid() at netbsd:do_sys_waitid+0x1ac
do_sys_wait() at netbsd:do_sys_wait+0x90
sys___wait450() at netbsd:sys___wait450+0x42
syscall() at netbsd:syscall+0x211
--- syscall (number 449) ---
netbsd:syscall+0x211:
In this path, though, I don't think it can hold exec_lock or
the child's p_reflock.
exec_lock is also held by a single reader, with writers
waiting:
db{0}> x/Lx exec_lock
netbsd:exec_lock: 23
Not sure who holds it. It could be the winebuild child that
holds exec_lock, via this sketchy path:
2345 struct posix_spawnattr *attrs = spawn_data->sed_attrs;
2346 if ((!attrs || (attrs->sa_flags
2347 & (POSIX_SPAWN_RETURNERROR|POSIX_SPAWN_SETPGROUP)) == 0)
2348 && rw_tryenter(&exec_lock, RW_READER)) {
2349 parent_is_waiting = false;
2350 mutex_enter(&spawn_data->sed_mtx_child);
2351 cv_signal(&spawn_data->sed_cv_child_ready);
2352 mutex_exit(&spawn_data->sed_mtx_child);
2353 }
https://nxr.netbsd.org/xref/src/sys/kern/kern_exec.c?r=1.518.4.1#2345
But it could also be any other process in the middle of
execve_loadvm.
The culprit can't fixed by kern_exec.c rev. 1.522, pullup-10
#1000, which added a missing rw_exit(&exec_lock) -- the path it
added that to takes a write lock, not a read lock.
It is interesting that there is a writer waiting. Not many
paths take a write lock on exec_lock. Looks like this might be
the R process:
db{0}> bt/t 0t23662
trace: pid 23662 lid 23662 at 0xffff9284efb0bc60
sleepq_block() at netbsd:sleepq_block+0x13a
turnstile_block() at netbsd:turnstile_block+0x3d2
rw_enter() at netbsd:rw_enter+0x18b
sigexit() at netbsd:sigexit+0x320
postsig() at netbsd:postsig+0x2c3
lwp_userret() at netbsd:lwp_userret+0x182
syscall() at netbsd:syscall+0x272
--- syscall (number 23662 via SYS_syscall) ---
netbsd:syscall+0x272:
(gdb) info line *(sigexit+0x320)
Line 2338 of "/home/riastradh/netbsd/10/src/sys/kern/kern_sig.c"
starts at address 0xffffffff80d0680f <sigexit+800>
and ends at 0xffffffff80d06826 <sigexit+823>.
(gdb) print 0x320
$9 = 800
(gdb) disas sigexit+800
...
0xffffffff80d0680a <+795>: callq 0xffffffff8023fbc0 <rw_enter>
0xffffffff80d0680f <+800>: mov 0x190(%rbx),%rsi
2336 #ifdef PAX_SEGVGUARD
2337 rw_enter(&exec_lock, RW_WRITER);
2338 pax_segvguard(l, p->p_textvp, p->p_comm, true);
2339 rw_exit(&exec_lock);
2340 #endif /* PAX_SEGVGUARD */
https://nxr.netbsd.org/xref/src/sys/kern/kern_sig.c?r=1.404#2336
That is, it appears the R process is crashing and taking a core
dump. Unfortunately, I see no reason for it to be related to
the winebuild process.
I skimmed the stack traces of various other processes but I
didn't find anything obvious.
>How-To-Repeat:
run bulk builds repeatedly (maybe especially wine)
>Fix:
Yes, please!
Home |
Main Index |
Thread Index |
Old Index