On Sat, 10 Oct 2015, Taylor R Campbell wrote:
Date: Sat, 10 Oct 2015 16:50:42 +0800 (PHT)
From: Paul Goyette <paul%vps1.whooppee.com@localhost>
While continuing to track down the zombie-that-would-not-die I managed
to find two more places where a process's p_stat and its parent's count
of children to wait for (p_nstopchild) get out of sync. The additional
issues are documented in PR kern/50308 and kern/50318.
With fixes for all four of these PRs in my local kernel, the zombie
problem seems to have disappeared, and no other ill effects have been
seen. I have confirmed that at least kern/50300 was being seen in my
local system, and correlated with the appearance of the long-lived
zombie; kern/50298 and kern/50308 have not been specifically observed.
Based on the analysis I just sent to one of PR 50318 (not noticing
until I was done that it applied to all four of them), the four
patches look good to me. Please commit them separately, with a brief
analysis and PR reference in each one, so we have a chance of
bisection if anything goes wrong.
Thanks for looking, and for providing the formal analysis. kre and I
had done pretty much the same investigation, albeit less formally.
I'll let the patches run for a while in my local code before I commit (and
request pull-ups to NetBSD-7).
We also ought to add automatic tests for proc.12.stop{exec,exit,fork},
since the code for them looks fishy and is likely seldom exercised.
Yeah. I'll try to figure out how to test this stuff. You're right,
these code paths appear to be rarely exercised.
+------------------+--------------------------+-------------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org |
+------------------+--------------------------+-------------------------+