Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: 10.99.7 panic: defibrillate
> Date: Mon, 14 Aug 2023 18:16:49 +0200
> From: Thomas Klausner <wiz%NetBSD.org@localhost>
>
> On Mon, Aug 14, 2023 at 12:41:06PM +0200, Thomas Klausner wrote:
> > I had followed your suggestion and bumped the heartbeat limit from 15
> > to 300, but today it paniced again.
> >
> > panic: cpu8: found cpu9 heart stopped beating and unresponsive
> >
> > I have a core dump in case you want any particular details.
> >
> > I've now switched set it to 0.
>
> and had a hard hang less than half a day later.
>
> This hasn't been happening in 10.99.5 (at least not with that
> frequency), which had uptimes of weeks, so either the heartbeat code
> introduced additional problems (even if disabled this way) or
> something else got worse, or I am really really unlucky right now.
Welp.
I don't think simply having the heartbeat(9) code around would cause a
hang -- it's new code, which is higher-risk, but the design of the
code is very low-risk (all loops are bounded; interrupt handler and
soft interrupt handler are short and easy to audit for bounded
latency; each CPU only writes to its own per-CPU state). I think it's
more likely something else changed.
Looks like it's time to bisect over the time since your last good
build, and see if you can make it a whole day without panicking?
874 commits since I bumped 10.99.5 (which was incidentally when I
introduced heartbeat(9)), so...it should only take a week or two if
the problem takes half a day to reproduce!
Home |
Main Index |
Thread Index |
Old Index