netbsd-bugs: kern/26562: mbuf leakage

Subject: kern/26562: mbuf leakage - maybe ALTQ related
To: None <gnats-bugs@gnats.NetBSD.org>
From: Thilo Manske <thilo@HEH.Uni-Oldenburg.DE>
List: netbsd-bugs
Date: 08/05/2004 20:27:46
>Number:         26562
>Category:       kern
>Synopsis:       mbuf leakage
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Aug 05 18:42:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     Thilo Manske
>Release:        NetBSD 2.0G
>Organization:
>Environment:
System: NetBSD Server 2.0G NetBSD 2.0G (HEHOL) #24: Sat Jul 31 01:05:08 MEST 2004 thilo@Server:/sys/arch/i386/compile/HEHOL i386
Architecture: i386
Machine: i386

Network related options:
options         NMBCLUSTERS=4096
options         SOSEND_NO_LOAN
options         GATEWAY         # packet forwarding
options         INET            # IP + ICMP + TCP + UDP
options         ALTQ            # Manipulate network interfaces' output queues
#options        ALTQ_BLUE       # Stochastic Fair Blue
options         ALTQ_CBQ        # Class-Based Queueing
#options        ALTQ_CDNR       # Diffserv Traffic Conditioner
#options        ALTQ_FIFOQ      # First-In First-Out Queue
#options        ALTQ_FLOWVALVE  # RED/flow-valve (red-penalty-box)
options         ALTQ_HFSC       # Hierarchical Fair Service Curve
#options        ALTQ_LOCALQ     # Local queueing discipline
options         ALTQ_PRIQ       # Priority Queueing
options         ALTQ_RED        # Random Early Detection
#options        ALTQ_RIO        # RED with IN/OUT
#options        ALTQ_WFQ        # Weighted Fair Queueing

SOSEND_NO_LOAD was added to help the situation (see below) but it didn't
make any difference.

>Description:
This system is used as a router, web proxy, webserver, mail server, DNS
server and some other stuff for about 150 students (i.e. a lot of network
traffic is generated by and routed through this box). On both of its
physical network interfaces (3c905 and 3c900) ALTQ is used (CBQ+PRIQ+RED, no
HFSC at the moment) to shape and prioritize the network traffic.

It wasn't possible to keep this system up for more than three weeks in this
use. With earlier versions of NetBSD (1.5+ALTQ patch) the symptoms were so
strange (crashes, freezes, or looping kernel messages from the SCSI
interface driver on the console) that I blamed rotting hardware for it but
after ALTQ got integrated into -current and I upgraded to 1.6ZI and later
2.0G the system didn't die that way, only network connections did and a
reboot was the only way I found to get out of that situations.

This made me gather some statistics:

The graph below shows the number of mbufs allocated to data and packet headers
plotted over the system's uptime (The data was collected by  mrtg running a
skript every 5 minutes which parses the output of netstat -m).

As you can see, the average number of mbufs allocated to data rises by about
100 with each day. 

                                 mbuf allocations
mbufs
       +-----------+----------+-----------+----------+-----------+----------+
       +           +          +           +          +          data ###### +
  1400 ++                                                    headers ******++
       |                                                                    |
       |                                                                    |
       |                                                             #      |
       |                                                             #      |
  1200 ++                                                            ##    ++
       |                                                             ##     |
       |                                          #                  ##     |
       |                                          #                  ##     |
  1000 ++                                         #                  ##    ++
       |                                          #          #       ## ##  |
       |                                # #       # #        ####    #####  |
       |                                ###       # #   # ########   #####  |
   800 ++                               ###   #   # ################ ###   ++
       |                                ###   # # # ############ ##### #    |
       |                               ####   ######## ####   #  #  ##      |
       |                               # ### ########  ###                  |
       |                #   #          # # ##### ##      #                  |
   600 ++               #  ##       ####     #                             ++
       |                #  #############                                    |
       |     ##  #  ###########     #             *     *                   |
       #    ### ## ### #                          *     *                   |
   400 #+   # ## # #                             **     *                  ++
       ##   #     ##                             **     *             *     |
       ######                                    **     * *         * *     |
       |  ##                            *        **     * * *       * *  *  |
       |                             ****     *  **  *  *******     * ** *  |
   200 ++                   *       *****     ** *****  ********    ****** ++
       |   **** ** *     *  **    ****** *    ********  *********   ******  |
       |   ******* *  ** *** **   ***    **  *********  *********  *******  |
       * *** *** ******** *   *****       *  *  *   ****** ** **** ** ****  +
     0 ****--------+***-------+***--------****-------****--------****------++
       0           1          2           3          4           5          6
                                    days uptime
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted: