NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/59376: wg(4) workqueue caught in a cycle of endless work
>Number: 59376
>Category: kern
>Synopsis: wg(4) workqueue caught in a cycle of endless work
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Apr 29 13:40:01 +0000 2025
>Originator: Taylor R Campbell
>Release: 10.1ish
>Organization:
Work Till You Drop BSD LLC GmbH SPQRump
>Environment:
>Description:
Found a wg(4) rump_server stuck in a loop. On investigation in
gdb, I discovered this:
(gdb) thread 3
[Switching to thread 3 (LWP 1145 of process 23633)]
#0 mtxenter (mtx=0x72b6e36ddf00)
at /home/riastradh/netbsd/10/src/lib/librumpuser/rumpuser_pth.c:177
177 /home/riastradh/netbsd/10/src/lib/librumpuser/rumpuser_pth.c: No such file or directory.
(gdb) bt
#0 mtxenter (mtx=0x72b6e36ddf00)
at /home/riastradh/netbsd/10/src/lib/librumpuser/rumpuser_pth.c:177
#1 rumpuser_mutex_enter (mtx=0x72b6e36ddf00)
at /home/riastradh/netbsd/10/src/lib/librumpuser/rumpuser_pth.c:203
#2 0x000072b6e000e20b in wg_peer_work (wk=0x72b6e2f84c38,
cookie=<optimized out>)
at /home/riastradh/netbsd/10/src/sys/rump/net/lib/libwg/../../../../net/if_wg.c:3565
#3 0x000072b6e3c7a6b4 in workqueue_runlist (list=0x72b6d3feff60,
list=0x72b6d3feff60, wq=0x72b6e36eda80)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:148
#4 workqueue_worker (cookie=0x72b6e36eda80)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:184
#5 0x000072b6e3cd597f in threadbouncer (arg=0x72b6e33adc40)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/librump/rumpkern/threads.c:90
#6 0x000072b6e2c0c89f in pthread__create_tramp (cookie=0x72b6e3258c00)
at /home/riastradh/netbsd/10/src/lib/libpthread/pthread.c:595
#7 0x000072b6e26930e0 in ?? () from /usr/lib/libc.so.12
#8 0x0000000000000000 in ?? ()
(gdb) fr 2
#2 0x000072b6e000e20b in wg_peer_work (wk=0x72b6e2f84c38,
cookie=<optimized out>)
at /home/riastradh/netbsd/10/src/sys/rump/net/lib/libwg/../../../../net/if_wg.c:3565
3565 /home/riastradh/netbsd/10/src/sys/rump/net/lib/libwg/../../../../net/if_wg.c: No such file or directory.
(gdb) print *wk
$1 = {wk_dummy = 0x72b6e2f84c38}
(gdb) print *(struct work_impl *)wk
$2 = {wk_entry = {sqe_next = 0x72b6e2f84c38}}
...
(gdb) fr 3
#3 0x000072b6e3c7a6b4 in workqueue_runlist (list=0x72b6d3feff60,
list=0x72b6d3feff60, wq=0x72b6e36eda80)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:148
148 /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c: No such file or directory.
(gdb) info locals
wk = <optimized out>
next = 0x72b6e2f84c38
(gdb) print *next
$6 = {wk_entry = {sqe_next = 0x72b6e2f84c38}}
(gdb) print next == next->wk_entry.sqe_next
$7 = 1
That is, somehow, the struct work has queued into a cycle of
endless work in the rat race. Needless to say, this is not
intentional; it ought to have a break for weekends, at least.
It's conceivable, from thi ssample, that the work has simply
been requeued by an interrupt or another thread shortly after
it was taken off the queue, which is allowed the moment the
workqueue's function is called. In the case of wg(4), the
workqueue function unblocks further requeueing under wgp_lock:
3565 mutex_enter(wgp->wgp_intr_lock);
3566 while ((tasks = wgp->wgp_tasks) != 0) {
3567 wgp->wgp_tasks = 0;
3568 mutex_exit(wgp->wgp_intr_lock);
https://nxr.netbsd.org/xref/src/sys/net/if_wg.c?r=1.71.2.5#3565
The corresponding queueing side is:
2668 mutex_enter(wgp->wgp_intr_lock);
2669 WG_DLOG("tasks=%d, task=%d\n", wgp->wgp_tasks, task);
2670 if (wgp->wgp_tasks == 0)
2671 /*
2672 * XXX If the current CPU is already loaded -- e.g., if
2673 * there's already a bunch of handshakes queued up --
2674 * consider tossing this over to another CPU to
2675 * distribute the load.
2676 */
2677 workqueue_enqueue(wg_wq, &wgp->wgp_work, NULL);
https://nxr.netbsd.org/xref/src/sys/net/if_wg.c?r=1.71.2.5#2668
But I took another sample in gdb at a slightly different time
inside workqueue_runlist, and it came out as:
(gdb) bt
#0 mutex_exit (mtx=0x72b6e42ea880)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/librump/rumpkern/locks.c:195
#1 0x000072b6e3c7a6b4 in workqueue_runlist (list=0x72b6d3feff60,
list=0x72b6d3feff60, wq=0x72b6e36eda80)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:148
#2 workqueue_worker (cookie=0x72b6e36eda80)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:184
#3 0x000072b6e3cd597f in threadbouncer (arg=0x72b6e33adc40)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/librump/rumpkern/threads.c:90
#4 0x000072b6e2c0c89f in pthread__create_tramp (cookie=0x72b6e3258c00)
at /home/riastradh/netbsd/10/src/lib/libpthread/pthread.c:595
#5 0x000072b6e26930e0 in ?? () from /usr/lib/libc.so.12
#6 0x0000000000000000 in ?? ()
That is, we're in the tail-call to mutex_exit in wg_peer_work
which will return here:
144 for (wk = SIMPLEQ_FIRST(list); wk != NULL; wk = next) {
145 next = SIMPLEQ_NEXT(wk, wk_entry);
146 SDT_PROBE4(sdt, kernel, workqueue, entry,
147 wq, wk, wq->wq_func, wq->wq_arg);
=> 148 (*wq->wq_func)((void *)wk, wq->wq_arg);
149 SDT_PROBE4(sdt, kernel, workqueue, return,
150 wq, wk, wq->wq_func, wq->wq_arg);
151 }
https://nxr.netbsd.org/xref/src/sys/kern/subr_workqueue.c?r=1.41.2.1#144
At this point, if the work has been legitimately queued again,
there must be some wg task associated with it. But review of
the wg peer shows that is not the case:
(gdb) fr 1
#1 0x000072b6e3c7a6b4 in workqueue_runlist (list=0x72b6d3feff60,
list=0x72b6d3feff60, wq=0x72b6e36eda80)
at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:148
148 /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c: No such file or directory.
(gdb) info locals
wk = <optimized out>
next = 0x72b6e2f84c38
(gdb) print (struct wg_peer *)((char *)next - (char *)&((struct wg_peer *)0)->wgp_work)
$3 = (struct wg_peer *) 0x72b6e2f83000
(gdb) print ((struct wg_peer *)0x72b6e2f83000)->wgp_tasks
$4 = 0
To confirm this hypothesis, I set a breakpoint on
wg_schedule_peer_task and continued, and it didn't fire:
(gdb) break wg_schedule_peer_task
Breakpoint 1 at 0x72b6e0006a9c: file /home/riastradh/netbsd/10/src/sys/rump/net/lib/libwg/../../../../net/if_wg.c, line 2666.
(gdb) c
Continuing.
OK, gdb just wedged and I dunno what it's up to now:
^C[ 1327882.5248464] load: 0.78 cmd: gdb 2446 [parked parked parked parked parked parked parked parked 0x73c23877f62c/2] 1.55u 1.15s 6% 57888k
^C[ 1327885.6808894] load: 0.96 cmd: gdb 2446 [parked parked parked parked parked parked parked parked 0x73c2386475da/2] 2.64u 3.15s 19% 57888k
^\[ 1327896.0450304] load: 1.12 cmd: gdb 2446 [parked parked parked parked parked parked parked parked 0x73c2386475da/1] 6.39u 9.52s 52% 57888k
[ 1327921.7293803] sorry, pid 23633 was killed: orphaned traced process
[1] Killed chroot /home/riastradh/netbsd/10/obj.amd64/destdir.amd64 gdb rump_server 23633
...aaaaaand I killed gdb and now wg(4) works again.
Out of time for further diagnostics now.
>How-To-Repeat:
run wg-userspace for weeks or months
>Fix:
Yes, please!
Home |
Main Index |
Thread Index |
Old Index