NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: port-vax/55415: vax no longer preempts in a timely fashion
The following reply was made to PR port-vax/55415; it has been noted by GNATS.
From: Greg Oster <oster%netbsd.org@localhost>
To: Anders Magnusson <ragge%tethuvudet.se@localhost>, gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Fri, 1 Jul 2022 17:25:06 -0600
An update on this....
If I add a separate check for cpu_intr_p() into
src/sys/kern/kern_runq.c:sched_resched_cpu() like this:
...
for (o = 0;; o = n) {
n = atomic_cas_uint(&ci->ci_want_resched, o, o | f);
if (__predict_true(o == n)) {
/*
* We're the first to set a resched on the CPU. Try
* to avoid causing a needless trip through trap()
* to handle an AST fault, if it's known the LWP
* will either block or go through userret() soon.
*/
if (l != curlwp || cpu_intr_p()) {
cpu_need_resched(ci, l, f);
}
break;
}
/* NEW CODE */
if (cpu_intr_p()) {
cpu_need_resched(ci, l, f);
break;
}
/* END OF NEW CODE */
if (__predict_true(
(n & (RESCHED_KPREEMPT|RESCHED_UPREEMPT)) >=
(f & (RESCHED_KPREEMPT|RESCHED_UPREEMPT)))) {
/* Already in progress, nothing to do. */
...
then the ping times drop from 9 seconds to 146ms, which is more in line
with what a 9.99.10 kernel does on the same hardware. The new code
doesn't fire very often, but when it does, it would have been precisely
when the 'ping stalls' would occur.
What I observed in testing that lead me here is that during the high
ping times the kernel is stuck looping in the "Already in progress,
nothing to do." section when in fact, there are interrupts(?) coming in
that need servicing.... Is this maybe something to do with VAX having
hardware ASTs?
These tests are also with a patch from Ragge to preserve ASTs in a PCB
across context switching.
I suspect such a 'fix' wouldn't be appropriate for all the other
architectures, but I don't know that for sure. Perhaps there's
something machine-dependent with VAX that doesn't fit into the current
machine-independent way of doing things? Or is there still some other
bit missing in the VAX code that would accomplish the above? What does
seem to be true is that on a VAX, the "nothing to do" isn't sufficient
for a performant (if we can call VAX that :) ) system.
Later...
Greg Oster
On 2020-07-30 14:49, oster%netbsd.org@localhost wrote:
> On 7/30/20 1:38 PM, Anders Magnusson wrote:
>>
>>
>> Den 2020-07-30 kl. 21:30, skrev oster%netbsd.org@localhost:
>>> The following reply was made to PR port-vax/55415; it has been noted
>>> by GNATS.
>>>
>>> From: oster%netbsd.org@localhost
>>> To: gnats-bugs%netbsd.org@localhost, port-vax-maintainer%netbsd.org@localhost,
>>> Â gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost, oster%netbsd.org@localhost
>>> Cc:
>>> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
>>> Date: Thu, 30 Jul 2020 13:25:48 -0600
>>>
>>> Â On 7/30/20 1:10 PM, Anders Magnusson wrote:
>>> Â > The following reply was made to PR port-vax/55415; it has been
>>> noted by GNATS.
>>> Â >
>>> Â > From: Anders Magnusson <ragge%tethuvudet.se@localhost>
>>> Â > To: gnats-bugs%netbsd.org@localhost, oster%netbsd.org@localhost
>>> Â > Cc:
>>> Â > Subject: Re: port-vax/55415: vax no longer preempts in a timely
>>> fashion
>>> Â > Date: Thu, 30 Jul 2020 21:07:37 +0200
>>> Â >
>>>  >  >  I've done a bit more debugging...  What I'm seeing is that in
>>> Â >Â Â >Â Â kern_runq.c:sched_resched_cpu() the call to
>>> cpu_need_resched(ci, l, f)
>>>  >  >  happens, cpu_need_resched() sets up the AST. Except it's
>>> only once in a
>>> Â >Â Â >Â Â while that the trap with the AST fires, userret() gets
>>> called, and
>>>  >  >  preemption happens! Sometimes the trap with AST fires
>>> once, and not
>>> Â >Â Â >Â Â again... sometimes it fires 5 times in a row, and then
>>> misses.... but I
>>> Â >Â Â >Â Â don't know why an AST that has been posted would
>>> subsequently get missed
>>> Â >Â Â >Â Â sometimes....
>>> Â >Â Â >
>>> Â >Â Â >Â Â So it's able to hit a situation where cpu_need_resched() is
>>> called, but
>>> Â >Â Â >Â Â the corresponding AST never fires. The loop in
>>> sched_resched_cpu() that
>>> Â >Â Â >Â Â sets ci->ci_want_resched keeps thinking (correctly!) that
>>> the AST has
>>> Â >Â Â >Â Â already been setup, and so doesn't try to call
>>> cpu_need_resched() again.
>>> Â >Â Â >Â Â Â Â When it gets 'stuck' like this, we never see an AST until
>>> the process
>>>  >  >  completes. (nor do we see preemption until the process
>>> completes.)
>>> Â >Â Â >Â Â That seems to be because if I check the AST status with:
>>> Â >Â Â >
>>> Â >Â Â >Â Â Â Â if (mfpr(PR_ASTLVL) != AST_OK)
>>> Â >Â Â >
>>> Â >Â Â >Â Â that condition is always true... (meaning the AST is not
>>> setup...)
>>> Â >Â Â >
>>>  >  >  Any ideas on how an AST can just 'disappear'? (I'm using
>>> the same
>>> Â >Â Â >Â Â mfpr() check right after the mtpr() setting of PR_ASTLVL,
>>> and there it
>>> Â >Â Â >Â Â thinks it's set just fine... so how does it go missing a
>>> few moments
>>> Â >Â Â >Â Â after????)
>>> Â >Â Â >
>>>  >  The AST is only acked if it has been taken.� This is done in
>>> trap(),
>>> Â >Â Â just before userret() is called.
>>> Â >Â Â Losing the AST should not be possible.
>>> Â >
>>> Â >Â Â Reading the VAX manual says that ASTLVL is not saved by svpctx,
>>> so if a
>>> Â >Â Â process switch occurs before the AST is delivered it will be lost.
>>> Â >Â Â Can this ever happen?
>>> Â Hmm... svpctx happens in softint_common(), which seems to be called
>>> from
>>>  lots of softFOO functions...  So if I'm reading this correctly, if we
>>> Â happen to get into softint_common then the AST will get lost....
>> AST is itself a softint (called at Softint level 2).
>> But we should probably add saving of AST levels in the PCB anyway.
>
> I'm happy to test :)
>
> Later...
>
> Greg Oster
Home |
Main Index |
Thread Index |
Old Index