NetBSD-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: FreeRADIUS instability
On 9/29/21 09:09, Pawel S. Veselov wrote:
Yes, the question is what happened to fd#3 (presumably the kqueue).
If you can get into the debugger (gdb <radiusd> <pid>) and look at
queue call and see what fd is passed to it?
It's still fd#3
What we have determined from tracing the process that fd#3 is just
being closed and then re-opened as another kqueue (due to fd reuse)
that radius then tries to keep using as its own, but since none of
its filters are there, the process is effectively dead.
So we caught where the queue is closed, and traced it back to
getaddrinfo(). That call both closes fd#3, creates a new kqueue
and leaves it open. This is the back trace from close:
#0 0x0000732d69c07892 in close () from /usr/lib/libpthread.so.1
#1 0x0000732d68f25da9 in __res_ndestroy () from /usr/lib/libc.so.12
#2 0x0000732d68f2676b in __res_vinit () from /usr/lib/libc.so.12
#3 0x0000732d68f26bef in __res_check () from /usr/lib/libc.so.12
#4 0x0000732d68f22220 in __res_nsend () from /usr/lib/libc.so.12
#5 0x0000732d68f2719c in ?? () from /usr/lib/libc.so.12
#6 0x0000732d68f27420 in ?? () from /usr/lib/libc.so.12
#7 0x0000732d68f2a5a9 in ?? () from /usr/lib/libc.so.12
#8 0x0000732d68f2a8bd in ?? () from /usr/lib/libc.so.12
#9 0x0000732d68f3ed49 in nsdispatch () from /usr/lib/libc.so.12
#10 0x0000732d68f286c8 in getaddrinfo () from /usr/lib/libc.so.12
The full stack traces and ktraces can be found here:
https://github.com/FreeRADIUS/freeradius-server/issues/4244
I have an idea of what's going on. AFAIU, libc maintains a kqueue
for issuing DNS requests. kqueues are not inherited on fork.
If the parent process calls getaddrinfo(), that creates an internal
DNS kqueue in its address space, assigned to a FD (let's say 3).
After fork() the child process will have that FD 3 as unused, let's
say the child immediately opens something permanent, which is
assigned FD 3.
Then the child calls getaddrinfo(). Now, the internal
state of the resolver still has this statp object that references
FD 3 (I don't believe it's cleaned up after fork), which is used by
the application, and the obvious collision occurs.
From ktrace:
Parent:
(getaddrinfo or such)
28913 1 radiusd 1632165412.994373444 CALL kqueue1(0x400000)
28913 1 radiusd 1632165412.994374612 RET kqueue1 3
... parent never closes 3
(fork)
28913 1 radiusd 1632165413.001116635 CALL fork
28913 1 radiusd 1632165413.001356463 RET fork 16226/0x3f62
(child creates its own kqueue)
16226 1 radiusd 1632165413.002185215 CALL kqueue
16226 1 radiusd 1632165413.002186171 RET kqueue 3
(child calls getaddrinfo, telltale is reading /etc/hosts)
16226 1 radiusd 1632397379.465012449 GIO fd 15 read 731 bytes
"# $NetBSD: hosts,v 1.9 2013/11/24 07:20:01 dholland Exp
16226 1 radiusd 1632397379.465033818 CALL
__gettimeofday50(0x7f7fff62e700,0)
(resolver uses FD 3 as its own, reading from it and closing it)
16226 1 radiusd 1632397379.465034253 RET __gettimeofday50 0
16226 1 radiusd 1632397379.465036295 CALL
__kevent50(3,0,0,0x7f7fff62e110,1,0x74fd7b7787b0)
16226 1 radiusd 1632397379.465037310 RET __kevent50 1
16226 1 radiusd 1632397379.465043316 CALL close(3)
I think the only way to fix this is to have the resolver state
cleaned up thoroughly after fork(). I can't see how this can be
worked around by applications.
Thank you.
Home |
Main Index |
Thread Index |
Old Index