tech-net archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: RFC: softint-based if_input
On Thu, Jan 28, 2016 at 12:17 AM, Taylor R Campbell
<campbell+netbsd-tech-kern%mumble.net@localhost> wrote:
> Date: Wed, 27 Jan 2016 16:51:22 +0900
> From: Ryota Ozaki <ozaki-r%netbsd.org@localhost>
>
> Here it is: http://www.netbsd.org/~ozaki-r/softint-if_input-ifqueue.diff
>
> Results of performance measurements of it are also added to
> https://gist.github.com/ozaki-r/975b06216a54a084debc
>
> The results are good but bothers me; it achieves better performance
> than vanilla (and the 1st implementation) on high load (IP forwarding).
> For fast forward, it also beats the 1st one.
>
> I thought that holding splnet during ifp->if_input (splnet is needed
> for ifqueue operations and so keep holding in the patch) might affect
> the results. So I tried to release during ifp->if_input but the results
> didn't change so much (the result of IP forwarding is still better
> than vanilla).
>
> Anyone have any ideas?
>
> Here's a wild guess: with vanilla, each CPU does
>
> wm_rxeof loop iteration
> if_input processing
> wm_rxeof loop iteration
> if_input processing
> ...
>
> back and forth. With softint-rx-ifq, each CPU does
>
> wm_rxeof loop iteration
> wm_rxeof loop iteration
> ...
> if_input processing
> if_input processing
> ...
>
> because softint processing is blocked until the hardintr handler
> completes. So vanilla might make less efficient use of the CPU cache,
> and vanilla might leave the rxq full for longer so that the device
> cannot fill it as quickly with incoming packets.
That might be true. If so, the real question may be why the old
implementation isn't efficient compared to the new one.
>
> Another experiment that might be worthwhile is to bind the interrupt
> to a specific CPU, and then use splnet instead of WM_RX_LOCK to avoid
> acquiring and releasing a lock for each packet.
In the measurements, all interrupts are already delivered to CPU#0.
Removing the lock doesn't change the results. I guess acquiring and
releasing a lock (w/o contentions) are low overhead. Note that
wm has a RX lock per HW queue, so RX processing can be done with no
lock contention basically.
> (On Intel >=Haswell,
> we should use transactional memory to avoid bus traffic for that
> anyway (and maybe invent an MD pcq(9) that does the same). But the
> experiment with wm(4) is easier, and not everyone has transactional
> memory.)
How does transactional memory help?
ozaki-r
Home |
Main Index |
Thread Index |
Old Index