Subject: Re: Network driver receive path
To: jonathan@dsg.stanford.edu <jonathan@dsg.stanford.edu>
From: Maen Suleiman <maen.suleiman@gmail.com>
List: tech-net
Date: 03/15/2007 12:48:19
Jonathan see below:

On 3/14/07, jonathan@dsg.stanford.edu <jonathan@dsg.stanford.edu> wrote:
>
> In message <9c1cad6e0703141024n6c8385dei3058e73a73a0e696@mail.gmail.com>,
> "Maen Suleiman" writes:
> >Hi,
> >
> >I am trying to tune our giga driver performance,
>
> Is a "giga" a gigabit Ethernet interface?

Yes

>
> >I have noticed that
> >the system spends 57% of the time on interrupts when we do an oriented
> >receive test, while the system spends only 20% of the time on
> >interrupts when we do an oriented send test.
> >
> >From the profiler results , we understood that the main reason of
> >spending this time on the RX interrupt was because of the
> >MGETHDR,MCLGET and bus_dmamap_load , and mainly because of the
> >bus_dmamap_load function.
>
> Are your tests sustaining the same (or closely comparable) throughput?
> If so, then your driver is DMA-mapping roughly the same amount of data
> for both transmit and receive.  Again, if so, that 'd tend to suggest
> the problem is interrupt rate on receive side, rather than transmit
> side.  The fix for *that* is to use interrupt mitigation, if you can.
>

I get in TX 150% better performance than RX

> On the other hand, if you are confident in your profile data pointing
> to bus_dmamap_load, perhaps the DMA map for receive data really is
> significantly more expensive (per packet), than for TX data.  At a
> wild guess, perhaps Rx incurs more work than for Tx (e.g., forcing
> lines of cached data from the CPU cache out into main memory?)
>

Usually TX involves cache flush while RX involve invalidation, and
invalidate should be less expensive than flush

> >The problem is that we couldn't find an alternative of allocating
> >mbufs and calling bus_dmamap_load in the RX interrupt,!
> >
> >Will using a task to do the mbuf handling help ?
>
> Nope, not at this time.  And in general, probably not for any
> single-CPU system: you're doing the same work, plus adding some
> context-switch overhead.
>
> [... reordered...]
>

Thanks

> >Is there a way to allocate a constant physical memory block for the RX
> >DMA , and then using this block for the mbufs that will be delivered
> >to the stack? In this case I must know when the TCP stack has finished
> >handling the mbuf, and then I will re-use the same memory physical
> >space!
>
> Not really, not in any MI way in NetBSD. bus_dma(9) does include a
> "BUS_DMA_COHERENT" mapping, but it's documented as being a "hint" to
> (machine-dependent) implementations of bus_dma(9); portable NetBSD
> drivers still have to issue appropriate bus_dma_sync() calls.
>

Thanks


> >Is there a way to tell the TCP stack to give me back the mbuf that was
> >delivered to it, and then I can re-use the same mbufs without calling
> >bus_dmamap_load?
>
> Not for mbufs, not really.
>
> For mbuf *clusters* you could implement a driver-private mbuf cluster
> pool, backed by normal DMA mechanisms.  You _could_ then attempt some
> machine-dependent violations of the machine-independent API, based on
> your own knowledge of your CPU and private memory pool; but such a
> driver wouldn't work on other ports of NetBSD to other CPU architectures
> (e.g., those which have IOMMUs and therefore rely on drivers following the documented
> bus_dma(9) API for correct operation.
>
>
> If possible, a better approach might be to extend the bus_dma(9)
> implementation and mbuf-cluster information, to attempt to cache and
> reuse more information, to avoid (for example) repeated
> KVA-to-physical mappings if you reuse the same physical
> addresses. That's likely to be a big undertaking, and I'd suggest some
> close discussion with Jason Thorpe before going down that route.
>
> But my guess is, you really need to find, and discuss options with,
> someone who understands both the bus_dma(9) backend for your CPU
> (ARM?)  and your non-PCI "giga" device.
>