netbsd-help: and spl issue? occasional hang-ups/panics with my driver

Subject: and spl issue? occasional hang-ups/panics with my driver
To: None <netbsd-help@netbsd.org>
From: elmar <elmar@engel-kg.com>
List: netbsd-help
Date: 06/27/2005 15:38:24
hi list,

i have written a driver for a virtual network device that is connected to
a real network device much like the vlan driver does.  since i didn't
have very much information on how to write that driver, i had several
looks at drivers like vlan, gif, ppp, tun, gre, tap.

the code is almost stable:  it has been running on two machines with high
load for more than a day or two.  to find out where to look for the
offending code in my driver i would appreciate getting some hints on

 - common pitfalls when writing a network driver
 - the usage of the spl* functions and/or locks (examples).
 - documentation on the software priority level the (network) driver
   entry points are called at

i have attached some more info on panics and hang-ups i have seen.  maybe
this will help point me in the right direction.

thank you for your attention.

regards
	elmar

 - - - - - - - - -


this issue appears with 3.99.6, 3.99.3, and 2.0.2.


i'm running different netbsd machines

 1) fast:
    AMD Athlon XP 3000+ (686-class), 2091.74 MHz, id 0x6a0
 2) medium:
    AMD Athlon Model 4 (Thunderbird) (686-class), 1202.21 MHz, id 0x642
 3) slow:
    VIA C3 Samuel 2/Ezra (686-class), 601.40 MHz, id 0x673

... in bridging mode with one of the interfaces being a virtual interface
(created much the same as if_vlan.c does).

the code between the virtual interface and its parent interface has lots
of work to do which produces high load in the interrupt context.

under low load, my systems ran fine at first glance.  under high load,
however, the medium sized machine 2 would hang or panic after about

    5s
    30s
    20min
    90min
    > 24h
    ...

it is only the machine 2 that i have seen panicking or hanging, not even
the small one although top(1) showed me that it was running at about 96%
in interrupt context and was really hard at work (slower reaction on the
console).


swapping the interfaces of machine 2 (virtual interface [cf. if_vlan.c]
on top of fxp0 now instead of ex0) made the panic or hang appear much
earlier (5s/30s instead of 90min).


DDB's callout command gave me the hint that a function was scheduled that
is not to be used by any other part throughout the software.  another
function was scheduled several times although it should only be called
via sysctl(8).  the last lines of the backtrace read like this:

    usb_all_tasks at 0xcac9e9f0
    myFctNotToBeCalledHere+0x3a
    mySendFctOkHere+0x37
    myTimerArray(varying,args) at myFctOnlyToBeCalledViaSysctl+0xb529b
    myFctThatIsNeverUsed+0xb3
    softclock+0x262
    Xsoftclock+0x26
    --- interrupt ---
    cpu_switch+0x9f
    ltsleep+0x327
    uvm_scheduler+0x74
    main(0,0,0,0,0) at main+0x66e


so i removed all parts containing timer functions and tried again.  the
idea behind this is that

 a) some code overwrote my timer array or called my code to schedule
    timer functions, so the above mentioned functions were scheduled
    (which they should never be)

 b) - the network thread and
    - the thread running the callout function
    are working on the same (critical) code pointing to some kind of
    locking and/or splxxx problem.

 c) kernel stack overflow?  neither KSTACK_CHECK_MAGIC nor
    KSTACK_CHECK_DR0 were defined.


now without all that timer stuff, after some hours, the small machine
also runs into a panic which doesn't give me too much information, just
that it lost its frame like this:

	DDB lost frame for netbsd:Xsoftnet+0xXX, trying 0xZZZZZZZZ

	(i don't have the exact numbers at hand right now)


could this be a kernel stack overflow?  googling didn't give me much
info.  one hint was to disable TCP SACK, which i think won't help me very
much since the machine is bridging (and not interested in tcp).  should i
downgrade if.c?

i reduced the kernel size removing mca, audio, scanners, mca, among
others.  my impression is that the panic or hang tends to appear earlier
the more features i take out of the kernel config file.

help is welcome.

regards, again
		elmar