Subject: kern/26562: mbuf leakage - maybe ALTQ related
To: None <gnats-bugs@gnats.NetBSD.org>
From: Thilo Manske <thilo@HEH.Uni-Oldenburg.DE>
List: netbsd-bugs
Date: 08/05/2004 20:27:46
>Number: 26562
>Category: kern
>Synopsis: mbuf leakage
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Aug 05 18:42:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator: Thilo Manske
>Release: NetBSD 2.0G
>Organization:
>Environment:
System: NetBSD Server 2.0G NetBSD 2.0G (HEHOL) #24: Sat Jul 31 01:05:08 MEST 2004 thilo@Server:/sys/arch/i386/compile/HEHOL i386
Architecture: i386
Machine: i386
Network related options:
options NMBCLUSTERS=4096
options SOSEND_NO_LOAN
options GATEWAY # packet forwarding
options INET # IP + ICMP + TCP + UDP
options ALTQ # Manipulate network interfaces' output queues
#options ALTQ_BLUE # Stochastic Fair Blue
options ALTQ_CBQ # Class-Based Queueing
#options ALTQ_CDNR # Diffserv Traffic Conditioner
#options ALTQ_FIFOQ # First-In First-Out Queue
#options ALTQ_FLOWVALVE # RED/flow-valve (red-penalty-box)
options ALTQ_HFSC # Hierarchical Fair Service Curve
#options ALTQ_LOCALQ # Local queueing discipline
options ALTQ_PRIQ # Priority Queueing
options ALTQ_RED # Random Early Detection
#options ALTQ_RIO # RED with IN/OUT
#options ALTQ_WFQ # Weighted Fair Queueing
SOSEND_NO_LOAD was added to help the situation (see below) but it didn't
make any difference.
>Description:
This system is used as a router, web proxy, webserver, mail server, DNS
server and some other stuff for about 150 students (i.e. a lot of network
traffic is generated by and routed through this box). On both of its
physical network interfaces (3c905 and 3c900) ALTQ is used (CBQ+PRIQ+RED, no
HFSC at the moment) to shape and prioritize the network traffic.
It wasn't possible to keep this system up for more than three weeks in this
use. With earlier versions of NetBSD (1.5+ALTQ patch) the symptoms were so
strange (crashes, freezes, or looping kernel messages from the SCSI
interface driver on the console) that I blamed rotting hardware for it but
after ALTQ got integrated into -current and I upgraded to 1.6ZI and later
2.0G the system didn't die that way, only network connections did and a
reboot was the only way I found to get out of that situations.
This made me gather some statistics:
The graph below shows the number of mbufs allocated to data and packet headers
plotted over the system's uptime (The data was collected by mrtg running a
skript every 5 minutes which parses the output of netstat -m).
As you can see, the average number of mbufs allocated to data rises by about
100 with each day.
mbuf allocations
mbufs
+-----------+----------+-----------+----------+-----------+----------+
+ + + + + data ###### +
1400 ++ headers ******++
| |
| |
| # |
| # |
1200 ++ ## ++
| ## |
| # ## |
| # ## |
1000 ++ # ## ++
| # # ## ## |
| # # # # #### ##### |
| ### # # # ######## ##### |
800 ++ ### # # ################ ### ++
| ### # # # ############ ##### # |
| #### ######## #### # # ## |
| # ### ######## ### |
| # # # # ##### ## # |
600 ++ # ## #### # ++
| # ############# |
| ## # ########### # * * |
# ### ## ### # * * |
400 #+ # ## # # ** * ++
## # ## ** * * |
###### ** * * * * |
| ## * ** * * * * * * |
| **** * ** * ******* * ** * |
200 ++ * ***** ** ***** ******** ****** ++
| **** ** * * ** ****** * ******** ********* ****** |
| ******* * ** *** ** *** ** ********* ********* ******* |
* *** *** ******** * ***** * * * ****** ** **** ** **** +
0 ****--------+***-------+***--------****-------****--------****------++
0 1 2 3 4 5 6
days uptime
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted: