tech-net archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Poor TCP performance as latency increases
mlelstv%serpens.de@localhost (Michael van Elst) writes:
>lloyd%must-have-coffee.gen.nz@localhost (Lloyd Parkes) writes:
>>If you are feeling enthusiastic you could create a test network with a
>>FreeBSD router and use FreeBSD's dummynet delay emulator. This might
>>eliminate a bunch of variables.
>I'm using Linux as a delay emulator, and you can see the same behaviour.
>After some error the connection falls into slow start and quickly
>enters congestion avoidance where the window grows very slowly.
>I also doubt that the SACK code is working correctly.
I tried to analyze this further, and while our TCP stack definitely
does something wrong, I also found that one of the machines in my
test setup behaved strangely.
Both test machines in the test setup were using wm(4) interfaces,
but only one showed massive re-ordering of packets in tcpdump. I
think that's a flaw in the driver.
The driver has a function wm_select_txqueue:
static inline int
wm_select_txqueue(struct ifnet *ifp, struct mbuf *m)
{
struct wm_softc *sc = ifp->if_softc;
u_int cpuid = cpu_index(curcpu());
/*
* Currently, simple distribute strategy.
* TODO:
* distribute by flowid(RSS has value).
*/
return ((cpuid + ncpu - sc->sc_affinity_offset) % ncpu) % sc->sc_nqueues;
}
which tries to distribute sent packets over multiple hardware queues
and since the queues are running independently, you can get bursts
of packets of several 100kByte out of order. Too much to be handled
nicely by TCP.
The "TODO" comment shows, how to handle this correctly. By using
the flow id to select the queue, you make sure that packets of a
single TCP stream are not reordered. Packets can still be reordered
by the network itself, but that usually happens on a much lower
level.
Another flaw I see is that the driver (on the same machine) drops
outgoing packets. On a long distance connection, this can cause
TCP to slow start the connection again and to enter 'congestion
avoidance' mode early.
The drops are visible on the txqXXpcqdrop event counters and are
caused by an intermediate queue overflowing. This intermediate
queue has to handle bursts of packets coming from the TCP stack,
and these bursts can be large on long distance connections with a
huge TCP window.
Unlike a regular interface queue (which isn't even used here), this
intermediate queue cannot be configured via net.interfaces.XXX.sndq.len
but the size is hardcoded in the driver.
The queue selection and usage of the intermediate queue are only used
on wm hardware that use MSI-X interrupts. And that's also the difference
between both test systems.
wm0 at pci0 dev 25 function 0, 64-bit DMA: PCH2 LAN (82579LM) Controller (rev. 0x04)
wm0: interrupting at msi0 vec 0
vs.
wm1 at pci6 dev 0 function 0, 64-bit DMA: I211 Ethernet (COPPER) (rev. 0x03)
wm1: for TX and RX interrupting at msix4 vec 0 affinity to 1
wm1: for TX and RX interrupting at msix4 vec 1 affinity to 2
wm1: for LINK interrupting at msix4 vec 2
Currently, I'm using the following patch to avoid both flaws:
Index: sys/dev/pci/if_wm.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/if_wm.c,v
retrieving revision 1.801
diff -p -u -r1.801 if_wm.c
--- sys/dev/pci/if_wm.c 10 Nov 2024 11:46:24 -0000 1.801
+++ sys/dev/pci/if_wm.c 22 Feb 2025 06:30:26 -0000
@@ -208,7 +208,7 @@ static int wm_watchdog_timeout = WM_WATC
* m_defrag() is called to reduce it.
*/
#define WM_NTXSEGS 64
-#define WM_IFQUEUELEN 256
+#define WM_IFQUEUELEN 4096
#define WM_TXQUEUELEN_MAX 64
#define WM_TXQUEUELEN_MAX_82547 16
#define WM_TXQUEUELEN(txq) ((txq)->txq_num)
@@ -224,7 +224,7 @@ static int wm_watchdog_timeout = WM_WATC
#define WM_MAXTXDMA (2 * round_page(IP_MAXPACKET)) /* for TSO */
-#define WM_TXINTERQSIZE 256
+#define WM_TXINTERQSIZE 4096
#ifndef WM_TX_PROCESS_LIMIT_DEFAULT
#define WM_TX_PROCESS_LIMIT_DEFAULT 100U
@@ -8842,6 +8842,7 @@ static inline int
wm_select_txqueue(struct ifnet *ifp, struct mbuf *m)
{
struct wm_softc *sc = ifp->if_softc;
+#if 0
u_int cpuid = cpu_index(curcpu());
/*
@@ -8850,6 +8851,9 @@ wm_select_txqueue(struct ifnet *ifp, str
* distribute by flowid(RSS has value).
*/
return ((cpuid + ncpu - sc->sc_affinity_offset) % ncpu) % sc->sc_nqueues;
+#else
+ return (sc->sc_affinity_offset + if_get_index(ifp)) % sc->sc_nqueues;
+#endif
}
static inline bool
Bumping the standard WM_IFQUEUELEN might not be required, but I wanted to
be on the safe side, also for the older hardware.
Home |
Main Index |
Thread Index |
Old Index