Subject: and spl issue? occasional hang-ups/panics with my driver
To: None <netbsd-help@netbsd.org>
From: elmar <elmar@engel-kg.com>
List: netbsd-help
Date: 06/27/2005 15:38:24
hi list,
i have written a driver for a virtual network device that is connected to
a real network device much like the vlan driver does. since i didn't
have very much information on how to write that driver, i had several
looks at drivers like vlan, gif, ppp, tun, gre, tap.
the code is almost stable: it has been running on two machines with high
load for more than a day or two. to find out where to look for the
offending code in my driver i would appreciate getting some hints on
- common pitfalls when writing a network driver
- the usage of the spl* functions and/or locks (examples).
- documentation on the software priority level the (network) driver
entry points are called at
i have attached some more info on panics and hang-ups i have seen. maybe
this will help point me in the right direction.
thank you for your attention.
regards
elmar
- - - - - - - - -
this issue appears with 3.99.6, 3.99.3, and 2.0.2.
i'm running different netbsd machines
1) fast:
AMD Athlon XP 3000+ (686-class), 2091.74 MHz, id 0x6a0
2) medium:
AMD Athlon Model 4 (Thunderbird) (686-class), 1202.21 MHz, id 0x642
3) slow:
VIA C3 Samuel 2/Ezra (686-class), 601.40 MHz, id 0x673
... in bridging mode with one of the interfaces being a virtual interface
(created much the same as if_vlan.c does).
the code between the virtual interface and its parent interface has lots
of work to do which produces high load in the interrupt context.
under low load, my systems ran fine at first glance. under high load,
however, the medium sized machine 2 would hang or panic after about
5s
30s
20min
90min
> 24h
...
it is only the machine 2 that i have seen panicking or hanging, not even
the small one although top(1) showed me that it was running at about 96%
in interrupt context and was really hard at work (slower reaction on the
console).
swapping the interfaces of machine 2 (virtual interface [cf. if_vlan.c]
on top of fxp0 now instead of ex0) made the panic or hang appear much
earlier (5s/30s instead of 90min).
DDB's callout command gave me the hint that a function was scheduled that
is not to be used by any other part throughout the software. another
function was scheduled several times although it should only be called
via sysctl(8). the last lines of the backtrace read like this:
usb_all_tasks at 0xcac9e9f0
myFctNotToBeCalledHere+0x3a
mySendFctOkHere+0x37
myTimerArray(varying,args) at myFctOnlyToBeCalledViaSysctl+0xb529b
myFctThatIsNeverUsed+0xb3
softclock+0x262
Xsoftclock+0x26
--- interrupt ---
cpu_switch+0x9f
ltsleep+0x327
uvm_scheduler+0x74
main(0,0,0,0,0) at main+0x66e
so i removed all parts containing timer functions and tried again. the
idea behind this is that
a) some code overwrote my timer array or called my code to schedule
timer functions, so the above mentioned functions were scheduled
(which they should never be)
b) - the network thread and
- the thread running the callout function
are working on the same (critical) code pointing to some kind of
locking and/or splxxx problem.
c) kernel stack overflow? neither KSTACK_CHECK_MAGIC nor
KSTACK_CHECK_DR0 were defined.
now without all that timer stuff, after some hours, the small machine
also runs into a panic which doesn't give me too much information, just
that it lost its frame like this:
DDB lost frame for netbsd:Xsoftnet+0xXX, trying 0xZZZZZZZZ
(i don't have the exact numbers at hand right now)
could this be a kernel stack overflow? googling didn't give me much
info. one hint was to disable TCP SACK, which i think won't help me very
much since the machine is bridging (and not interested in tcp). should i
downgrade if.c?
i reduced the kernel size removing mca, audio, scanners, mca, among
others. my impression is that the panic or hang tends to appear earlier
the more features i take out of the kernel config file.
help is welcome.
regards, again
elmar