Re: kern/56979: fork(2) fails to be signal safe

To: lib-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,tgl%sss.pgh.pa.us@localhost
Subject: Re: kern/56979: fork(2) fails to be signal safe
From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
Date: Sun, 16 Oct 2022 01:20:01 +0000 (UTC)

The following reply was made to PR lib/56979; it has been noted by GNATS.

From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
To: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/56979: fork(2) fails to be signal safe
Date: Sat, 15 Oct 2022 21:17:37 -0400

 Taylor R Campbell <riastradh%NetBSD.org@localhost> writes:
 >> From: Tom Lane <tgl%sss.pgh.pa.us@localhost>
 >> Didn't take long to find out that there's still a problem.  With
 >> this patch, it gets past the fork() all right, but there's still
 >> a risk of the child process getting stuck on the RTLD lock later:

 > Do I understand correctly that this means you're trying to call dlopen
 > from a signal handler?

 Well, it *was* a signal handler, but once it issues fork() I wouldn't
 personally regard it as a signal handler anymore.  The child process
 is certainly never going to return control to the interrupted code.
 The parent process (the Postgres "postmaster") runs with signals blocked
 everywhere except this one select() call in its wait loop, so it's safer
 than it sounds.  The postmaster has been coded like that since the
 nineties, and AFAIR this is the first bit of trouble we've had with it.

 > I didn't follow exactly what you're doing, but I suspect it would be
 > much more reliable to have the signal handler set a flag or write a
 > flag to a pipe and cause select(2) to fail with EINTR and process the
 > flag

 People have been talking about changing the postmaster to not do anything
 interesting in signal handlers since the nineties, too, and it's not
 gotten done yet.  The effort-to-reward ratio is just not very good.
 It might happen sometime, but I'm not holding my breath.

 >> (BTW, is the RTLD lock business new in v10?  I'm surprised that
 >> we've not heard field reports of Postgres getting stuck at startup
 >> on NetBSD.)

 > Yes.

 OK, thanks for confirming that.  What we've done about this for the
 moment is to force linking with -Wl,-z,now on NetBSD, which fixes
 this particular problem --- at least, we've not seen it since then
 on two different NetBSD test machines that previously did exhibit
 the failure intermittently --- and it seems like generally a good
 idea anyway.

 			regards, tom lane

Prev by Date: Re: kern/56979: fork(2) fails to be signal safe
Next by Date: Re: kern/56979: fork(2) fails to be signal safe
Previous by Thread: Re: kern/56979: fork(2) fails to be signal safe
Next by Thread: Re: kern/56979: fork(2) fails to be signal safe
Indexes:

Home | Main Index | Thread Index | Old Index