NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/59376: wg(4) workqueue caught in a cycle of endless work



>Number:         59376
>Category:       kern
>Synopsis:       wg(4) workqueue caught in a cycle of endless work
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Apr 29 13:40:01 +0000 2025
>Originator:     Taylor R Campbell
>Release:        10.1ish
>Organization:
Work Till You Drop BSD LLC GmbH SPQRump
>Environment:
>Description:
	Found a wg(4) rump_server stuck in a loop.  On investigation in
	gdb, I discovered this:

(gdb) thread 3
[Switching to thread 3 (LWP 1145 of process 23633)]
#0  mtxenter (mtx=0x72b6e36ddf00)
    at /home/riastradh/netbsd/10/src/lib/librumpuser/rumpuser_pth.c:177
177     /home/riastradh/netbsd/10/src/lib/librumpuser/rumpuser_pth.c: No such file or directory.
(gdb) bt
#0  mtxenter (mtx=0x72b6e36ddf00)
    at /home/riastradh/netbsd/10/src/lib/librumpuser/rumpuser_pth.c:177
#1  rumpuser_mutex_enter (mtx=0x72b6e36ddf00)
    at /home/riastradh/netbsd/10/src/lib/librumpuser/rumpuser_pth.c:203
#2  0x000072b6e000e20b in wg_peer_work (wk=0x72b6e2f84c38,
    cookie=<optimized out>)
    at /home/riastradh/netbsd/10/src/sys/rump/net/lib/libwg/../../../../net/if_wg.c:3565
#3  0x000072b6e3c7a6b4 in workqueue_runlist (list=0x72b6d3feff60,
    list=0x72b6d3feff60, wq=0x72b6e36eda80)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:148
#4  workqueue_worker (cookie=0x72b6e36eda80)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:184
#5  0x000072b6e3cd597f in threadbouncer (arg=0x72b6e33adc40)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/librump/rumpkern/threads.c:90
#6  0x000072b6e2c0c89f in pthread__create_tramp (cookie=0x72b6e3258c00)
    at /home/riastradh/netbsd/10/src/lib/libpthread/pthread.c:595
#7  0x000072b6e26930e0 in ?? () from /usr/lib/libc.so.12
#8  0x0000000000000000 in ?? ()
(gdb) fr 2
#2  0x000072b6e000e20b in wg_peer_work (wk=0x72b6e2f84c38,
    cookie=<optimized out>)
    at /home/riastradh/netbsd/10/src/sys/rump/net/lib/libwg/../../../../net/if_wg.c:3565
3565    /home/riastradh/netbsd/10/src/sys/rump/net/lib/libwg/../../../../net/if_wg.c: No such file or directory.
(gdb) print *wk
$1 = {wk_dummy = 0x72b6e2f84c38}
(gdb) print *(struct work_impl *)wk
$2 = {wk_entry = {sqe_next = 0x72b6e2f84c38}}
...
(gdb) fr 3
#3  0x000072b6e3c7a6b4 in workqueue_runlist (list=0x72b6d3feff60,
    list=0x72b6d3feff60, wq=0x72b6e36eda80)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:148
148     /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c: No such file or directory.
(gdb) info locals
wk = <optimized out>
next = 0x72b6e2f84c38
(gdb) print *next
$6 = {wk_entry = {sqe_next = 0x72b6e2f84c38}}
(gdb) print next == next->wk_entry.sqe_next
$7 = 1

	That is, somehow, the struct work has queued into a cycle of
	endless work in the rat race.  Needless to say, this is not
	intentional; it ought to have a break for weekends, at least.

	It's conceivable, from thi ssample, that the work has simply
	been requeued by an interrupt or another thread shortly after
	it was taken off the queue, which is allowed the moment the
	workqueue's function is called.  In the case of wg(4), the
	workqueue function unblocks further requeueing under wgp_lock:

   3565 	mutex_enter(wgp->wgp_intr_lock);
   3566 	while ((tasks = wgp->wgp_tasks) != 0) {
   3567 		wgp->wgp_tasks = 0;
   3568 		mutex_exit(wgp->wgp_intr_lock);

https://nxr.netbsd.org/xref/src/sys/net/if_wg.c?r=1.71.2.5#3565

	The corresponding queueing side is:

   2668 	mutex_enter(wgp->wgp_intr_lock);
   2669 	WG_DLOG("tasks=%d, task=%d\n", wgp->wgp_tasks, task);
   2670 	if (wgp->wgp_tasks == 0)
   2671 		/*
   2672 		 * XXX If the current CPU is already loaded -- e.g., if
   2673 		 * there's already a bunch of handshakes queued up --
   2674 		 * consider tossing this over to another CPU to
   2675 		 * distribute the load.
   2676 		 */
   2677 		workqueue_enqueue(wg_wq, &wgp->wgp_work, NULL);

https://nxr.netbsd.org/xref/src/sys/net/if_wg.c?r=1.71.2.5#2668

	But I took another sample in gdb at a slightly different time
	inside workqueue_runlist, and it came out as:

(gdb) bt
#0  mutex_exit (mtx=0x72b6e42ea880)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/librump/rumpkern/locks.c:195
#1  0x000072b6e3c7a6b4 in workqueue_runlist (list=0x72b6d3feff60,
    list=0x72b6d3feff60, wq=0x72b6e36eda80)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:148
#2  workqueue_worker (cookie=0x72b6e36eda80)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:184
#3  0x000072b6e3cd597f in threadbouncer (arg=0x72b6e33adc40)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/librump/rumpkern/threads.c:90
#4  0x000072b6e2c0c89f in pthread__create_tramp (cookie=0x72b6e3258c00)
    at /home/riastradh/netbsd/10/src/lib/libpthread/pthread.c:595
#5  0x000072b6e26930e0 in ?? () from /usr/lib/libc.so.12
#6  0x0000000000000000 in ?? ()

	That is, we're in the tail-call to mutex_exit in wg_peer_work
	which will return here:

    144 	for (wk = SIMPLEQ_FIRST(list); wk != NULL; wk = next) {
    145 		next = SIMPLEQ_NEXT(wk, wk_entry);
    146 		SDT_PROBE4(sdt, kernel, workqueue, entry,
    147 		    wq, wk, wq->wq_func, wq->wq_arg);
 => 148 		(*wq->wq_func)((void *)wk, wq->wq_arg);
    149 		SDT_PROBE4(sdt, kernel, workqueue, return,
    150 		    wq, wk, wq->wq_func, wq->wq_arg);
    151 	}

https://nxr.netbsd.org/xref/src/sys/kern/subr_workqueue.c?r=1.41.2.1#144

	At this point, if the work has been legitimately queued again,
	there must be some wg task associated with it.  But review of
	the wg peer shows that is not the case:

(gdb) fr 1
#1  0x000072b6e3c7a6b4 in workqueue_runlist (list=0x72b6d3feff60,
    list=0x72b6d3feff60, wq=0x72b6e36eda80)
    at /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c:148
148     /home/riastradh/netbsd/10/src/lib/librump/../../sys/rump/../kern/subr_workqueue.c: No such file or directory.
(gdb) info locals
wk = <optimized out>
next = 0x72b6e2f84c38
(gdb) print (struct wg_peer *)((char *)next - (char *)&((struct wg_peer *)0)->wgp_work)
$3 = (struct wg_peer *) 0x72b6e2f83000
(gdb) print ((struct wg_peer *)0x72b6e2f83000)->wgp_tasks
$4 = 0

	To confirm this hypothesis, I set a breakpoint on
	wg_schedule_peer_task and continued, and it didn't fire:

(gdb) break wg_schedule_peer_task
Breakpoint 1 at 0x72b6e0006a9c: file /home/riastradh/netbsd/10/src/sys/rump/net/lib/libwg/../../../../net/if_wg.c, line 2666.
(gdb) c
Continuing.

	OK, gdb just wedged and I dunno what it's up to now:

^C[ 1327882.5248464] load: 0.78  cmd: gdb 2446 [parked parked parked parked parked parked parked parked 0x73c23877f62c/2] 1.55u 1.15s 6% 57888k
^C[ 1327885.6808894] load: 0.96  cmd: gdb 2446 [parked parked parked parked parked parked parked parked 0x73c2386475da/2] 2.64u 3.15s 19% 57888k
^\[ 1327896.0450304] load: 1.12  cmd: gdb 2446 [parked parked parked parked parked parked parked parked 0x73c2386475da/1] 6.39u 9.52s 52% 57888k
[ 1327921.7293803] sorry, pid 23633 was killed: orphaned traced process
[1]   Killed                  chroot /home/riastradh/netbsd/10/obj.amd64/destdir.amd64 gdb rump_server 23633

	...aaaaaand I killed gdb and now wg(4) works again.

	Out of time for further diagnostics now.

>How-To-Repeat:

	run wg-userspace for weeks or months


>Fix:

	Yes, please!




Home | Main Index | Thread Index | Old Index