Subject: Re: NetBSD in BSD Router / Firewall Testing
To: Mike Tancsa <mike@sentex.net>
From: None <jonathan@dsg.stanford.edu>
List: tech-net
Date: 12/01/2006 14:31:31
In message <200612012036.kB1KaJJK053936@lava.sentex.ca>Mike Tancsa writes
>At 01:49 PM 12/1/2006, Jonathan Stone wrote:
>
>>As sometime principial maintaner of NetBSD's bge(4) driver, and the
>>author of many of the changes and chip-variant support subsequently
>>folded into OpenBSD's bge(4) by brad@openbsd.org, I'd like to speak
>>to a couple of points here.
>
>First off, thanks for the extended insights! This has been a most
>interesting exercise for me.
You're most welcome. (And thank you in turn for giving me a periodic
reminder that I really should write some text about interrupt
mitigation for NetBSD's bge(4) manpage.)
[[Jonathan comments that we're 2 or 3 orders of magnitude away
from where switch VLAN insertion should matter].
>Unfortunately, my budget is not so high that I can afford to have a
>high end gigE switch in my test area. I started off with a linksys,
>which I managed to hang under moderately high loads. I had an
>opportunity to test the Netgear and it was a pretty reasonable price
>(~$650 USD) for what it claims its capable of (17Mpps).
Hmm, so 17Mpps versus some 0.45 Mpps is a factor of 37; lets call
it 2 and a half orders of magnitude :-/.
> Similarly, trunking,
>although a bit wonky to configure (I am far more used to Cisco land)
>at least works and doesnt seem to degrade overall performance.
"Trunking" is overloaded: it can be used mean either link aggregation,
or VLAN-tagging. I have found "trunking" causese enough
misunderstandings that I avoid using the term. I assume here you mean
insertion of VLAN tags, as e.g., commonly used for switch-to-switch
links?
>>Second point: NetBSD's bge(4) driver includes support for runtime
>>manual tuning of interrupt mitigation. I chose the tuning values
>>based on empirical measurements of large TCP flows on bcm5700s and bcm5704s.
[....]
>hw.bge.rx_lvl = 0
Yes. I can never remember if it's a global or per-device-instance.
(My original code was global, others have asked for per-instance).
Snipping the following...
>#
>
>With ipf enabled and 10 poorly written rules.
>
>rx_lvl pps
>
>0 219,181
>1 229,334
>2 280,508
>3 328,896
>4 333,585
>5 346,974
I beleive the following were before-and-after stats for a 10-second
run:
>ipintrq:
> queue length: 0
> maximum queue length: 256
> packets dropped: 180561075
>ipintrq:
> queue length: 0
> maximum queue length: 256
> packets dropped: 183066795
Hmm. That indicates ipintrq dropped 2505720 packets during your
10-second run. Call it 250k packet drops/sec. Can you repeat your test
after increasing ipintrq via (as root)
sysctl=-w net.inet.ip.ifq.maxlen=1024
Or even increase to 2048? As I mentioned earlier, even TCP traffic
(bidirectional ttcp streams have 1 ack ever 2 packets or a 2:1 ratio
of full-size framse to minimum-size frames), I need to configure about
512 ipintrq entries per interface. The default value of 256 isn't
really appropriate for multiple GbE interfaces using interrupt
moderation; but it is at least better than the former [ex-CSRG]
default of 50 which dated back to 10Mbit Ethernet. (Or even 3Mbit?)
>>I therefore see very, very good grounds to expect that NetBSD would
>>show much better performance if you increase bge interrupt mitigation.
>
>Yup, it certainly seems so!
I would hope NetBSD can do even better again, after attention to
runtime tunables; but see below.
>There are certainly tradeoffs. I guess for me in a firewall capacity,
>I want to be able to get into the box OOB when its under
>attack. 1Mpps is still considered a medium to heavy attack right
>now, but with more and more botnets out there, its only going to get
>more common place :( I guess I would like the best of both worlds, a
>way to give priority for OOB access, be that serial console or other
>interface... But I dont see a way of doing that right now via Interrupt method.
Oh, it's doable, given patience; I've done it. The first step is to
mitigate hardware interrupts to a level where the CPU can keep up with
hardware interrupt servicing of a minimal-length traffic stream, with
CPU to spare. The second step is to tweak (or fine-tune) ipintrq max
depth to where ipintrq overflows *just* enough that procssing the
non-overflowed packets (done at spl[soft]net) don't leave you
livelocked. On the other hand, any fastpath forwarding that bypasses
ipintrq makes that approach impossible :).
>>Even so, I'd be glad to work on improving bge(4) tuning for workloads
>>dominated by tinygrams. The same packet rate as ttcp (over
>>400kpacket/sec on a 2.4Ghz Opteron) seems like an achievable target
>>--- unless there's a whole lot of CPU processing going on inside
>>IP-forwarding that I'm wholly unaware of.
>
>The AMD I am testing on is just a 3800 X2 so ~ 2.0Ghz.
Hmm. I can probably attempt to set up two bcm5721s in a similar box;
I'd have to look into load-generation.
>>At a recieve rate of 123Mbyte/sec per bge interface, I see roughly
>>5,000 interrupts per bge per second. What interrupt rates are you
>>seeing for each bge device in your tests?
>
[...]
>
>That was with hw.bge.rx_lvl=5
Sorry, I didn't keep your dmesg. which interrupts were the bge devices?
>Its hard to reproduce, but if I use 2 generators to blast in one
>direction, it seems to trigger it even with the value at 5
>
>Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
If I'm reading -current correctly, the message indicfates that the
hardware Tx queue filled up, and therefore an outbound packet was put
onto the software queue, IFF_OACTIVE was set, in the hope that the
packet will be picked up later when the Tx queue has space available.
But for that to work, bge_start() should return whenever it's called with
OFF_ACTIVE set. bge_start() lacks that check. bge_intr() has a check before
it calls bge_start(), but the other calls to bge_start (bge_tick()
don't do that. (Some calls check for ifq_snd non-NULL, but that may be
a hangover from Christos' iintial import of Bill Paul's original code.
Let's talk about that offline. if nothing else, you could try ifdef'ing
out the printf().