NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/58694: race condition when ptracing threaded applications



>Number:         58694
>Category:       kern
>Synopsis:       race condition when ptracing threaded applications
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Sep 25 09:55:00 +0000 2024
>Originator:     Martin Husemann
>Release:        NetBSD 10.99.12
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD big-apple.aprisoft.de 10.99.12 NetBSD 10.99.12 (POWERMAC_G5.MP) #119: Tue Sep 24 16:48:52 CEST 2024 martin%seven-days-to-the-wolves.aprisoft.de@localhost:/work/src/sys/arch/macppc/compile/POWERMAC_G5.MP macppc
Architecture: powerpc
Machine: macppc
>Description:

I am not sure if this should be a kern bug for the ptrace interface or a
toolchain bug for gdb issues around it - all existing ptrace ATF tests
pass fine, and AFAICT also include a threaded variant where new threads
exit - so this must be special somehow.

It also may be related to PR 58691.

I run a browser under gdb and while it is still starting up gdb gets confused
about threads. Note that LWP 1090 seems to come and go in the output below,
this is strange as pid/lid recycling shouldn't happen this fast and the machine
was otherwise mostly idle. So maybe there is some hickup in ptrace event 
reporting or gdb gets confused on the userland side - I am open to ideas
what to investigate next.

To get this output I added a DPRINTF to powerpc process_machdep.c that
would have explained EBUSY but I never saw that fire, and I added
options PTRACE_DEBUG to my kernel config. I also slightly modified the
DPRINTF that creates the last kernel output line below to get more information
about the lwp being passed as "l" into that code (where previously it only
printed the "t" target details pid, stat and flags = !t->p_waited.

[ 9539.8398799] ptrace_startstop: lwp=2041 request=21
[ 9539.8398799] ptrace_startstop: lwp=2850 request=21
[ 9540.0598920] ptrace_get_siginfo: lwp=1090 signal=5
[ 9540.0598920] ptrace_get_process_state: lwp=1090 event=0x8 pid=1090 lwp=1090
[New LWP 1090 of process 2020]
[ 9540.0598920] ptrace_regs: lwp=1596 request=33
[ 9540.0598920] ptrace_startstop: lwp=2020 request=21
[ 9540.0598920] ptrace_startstop: lwp=2020 request=21
[ 9540.0598920] ptrace_startstop: lwp=2668 request=21
[ 9540.0598920] ptrace_startstop: lwp=2621 request=21
[ 9540.0598920] ptrace_startstop: lwp=3048 request=21
[ 9540.0598920] ptrace_startstop: lwp=2388 request=21
[ 9540.0598920] ptrace_startstop: lwp=2034 request=21
[ 9540.0598920] ptrace_startstop: lwp=2040 request=21
[ 9540.0598920] ptrace_startstop: lwp=1780 request=21
[ 9540.0598920] ptrace_startstop: lwp=2537 request=21
[ 9540.0598920] ptrace_startstop: lwp=2648 request=21
[ 9540.0598920] ptrace_startstop: lwp=2042 request=21
[ 9540.0598920] ptrace_startstop: lwp=1329 request=21
[ 9540.0598920] ptrace_startstop: lwp=3050 request=21
[ 9540.0598920] ptrace_startstop: lwp=2266 request=21
[ 9540.0598920] ptrace_startstop: lwp=1336 request=21
[ 9540.0598920] ptrace_startstop: lwp=1248 request=21
[ 9540.0598920] ptrace_startstop: lwp=1371 request=21
[ 9540.0598920] ptrace_startstop: lwp=1576 request=21
[ 9540.0598920] ptrace_startstop: lwp=2264 request=21
[ 9540.0598920] ptrace_startstop: lwp=1829 request=21
[ 9540.0598920] ptrace_startstop: lwp=2045 request=21
[ 9540.0598920] ptrace_startstop: lwp=1613 request=21
[ 9540.0598920] ptrace_startstop: lwp=2041 request=21
[ 9540.0598920] ptrace_startstop: lwp=2850 request=21
[ 9540.0598920] ptrace_startstop: lwp=1090 request=21
[ 9540.0798926] ptrace_get_siginfo: lwp=1090 signal=5
[ 9540.0798926] ptrace_get_process_state: lwp=1090 event=0x10 pid=1090 lwp=1090
[LWP 1090 of process 2020 exited]
[New LWP 1090 of process 2020]
[LWP 1090 of process 2020 exited]
[ 9540.0798926] req 33 pid 2020 (l->l_proc->p_pid 1596 lid 1596) stat 2 flag 0 -> EBUSY
Couldn't get registers: Device busy.


I am confused by LWP 1090 being noted as new and exited and new and exited
by gdb, and then PT_GETREGS (req 33) failing for some lwp that I did not
print out (will try to add that).

The userland state after the failure is:

[~] martin@big-apple > ps -sp 2020
UID  PID PPID    CPU  LID NLWP PRI NI    VSZ   RSS WCHAN  STAT TTY     LTIME COMMAND
205 2020 1596 194389 1736   25  30  0 403004 82800 -      T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2850   25  37  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2041   25  37  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 1613   25  37  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2045   25  37  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 1829   25  37  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2264   25  37  0 403004 82800 parked T    pts/2 0:00.04 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 1576   25  38  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 1371   25  38  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 1248   25  40  0 403004 82800 parked T    pts/2 0:00.02 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 1336   25  40  0 403004 82800 parked T    pts/2 0:00.01 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2266   25  43  0 403004 82800 -      T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 3050   25  42  0 403004 82800 parked T    pts/2 0:00.03 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 1329   25  43  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2042   25  42  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2648   25  43  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2537   25  43  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 1780   25  43  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2034   25  43  0 403004 82800 parked T    pts/2 0:00.01 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2040   25  43  0 403004 82800 parked T    pts/2 0:00.14 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2388   25  43  0 403004 82800 parked T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 3048   25  43  0 403004 82800 -      T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2621   25  85  0 403004 82800 poll   T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2668   25  85  0 403004 82800 kqueue T    pts/2 0:00.00 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 
205 2020 1596 194389 2020   25  30  0 403004 82800 parked T    pts/2 0:02.36 /pkgbuild/usr/pkgobj/www/palemoon/work/build/dist/bin/palemoon 


>How-To-Repeat:

(on my machine > 70% reproducable)

	cd $(pkgsrc www/palemoon work dir)
	cd build/dist/bin
	./run-mozilla.sh -g ./palemoon


>Fix:
n/a



Home | Main Index | Thread Index | Old Index