Subject: kill -HUP pid-of-ipmon occasionally freezes system?
To: None <current-users@netbsd.org>
From: Arto Selonen <arto@selonen.org>
List: current-users
Date: 03/17/2004 16:08:45
Hi!
For the past year or so (!), we've experienced the following problem in
one of our NetBSD-current systems. It may happen twice a week, or it may
take a month to show up, but eventually, after sending SIGHUP to
ipmon, the whole system freezes. Unfortunately, it has not disappeared
with OS upgrades, and even though we suspect ipfilter, that has not been
upgraded for quite a while either. Repeated searches to various
mailing lists have not produced anything obvious, although we might
have missed something anyway. :( So, I'm asking whether anybody else
has experienced similar problems, and/or might have some suggestions.
Here is a brief list of what we've learned so far:
1) when the system freezes, console looks dead:
- no response for key presses (getty running)
- kernel debugger can be accessed!
2) 'trace' has not revealed anything obvious/useful to us
- it seems a bit different each time, ie.
not always the same processes/same order
3) network seems dead
- already established sessions seem to work
(there have been cases where they didn't work)
- no new connections to the system or through it
(there have been cases where certain new sessions worked)
- connectionless traffic (ipfilter-wise) gets through
4) occasionally, the problem goes away on its own
- it is known to have cleared in three hours, or
lasted over 10 hours (after which rebooted)
5) we've tracked it down to log file rotation
- the freeze seemed to occur after trying to rotate
ipfilter/ipmon logs
- looking at src/dist/ipf/ipmon.c doesn't show anything
obviously suspicious at signal handler handlehup()
6) network symptoms point to state table size / session timeout
- we've tried increasing these:
sys/net/if.h IFQ_MAXLEN 50->500
sys/netinet/ip_state.h IPSTATE_SIZE 5737->19997
sys/netinet/ip_state.h IPSTATE_MAX 4013->14009
- the above *may* have raised the average uptime
from roughly two weeks to four weeks
- further increase seemed to lower it down to 1 week
- maybe it's irrelevant?
- we've now lowered idle TCP session timeout from 120h -> 24h
7) no obvious correlation between recorded network traffic amounts
and observed "crashes"
- based on switch port counter stats
- there are times of peak traffic that were handled
without problems
From previous debugging, we know that SIGHUP/ipmon *is* involved. If
syslogd was HUP'ed after ipmon, then we would only see one "syslogd:
restart" in /var/log/messages (caused by rebooting the stuck system),
whereas after we moved ipmon to be HUP'ed after syslogd, messages
would also show the entry from ~midnight. On the other hand, there
have been cases where the processing advanced from SIGHUP stage to the
point of looking at the log files, and sending the reports. In those
cases, the system would freeze before all log files were handled (not
always at the same point, so it's not a problem with the log file filters).
So, SIGHUP to ipmon would seem to trigger the problem, but the problem may
not surface instantly. Sounds weird, doesn't it.
Since the problem occurs in a relatively short window (less than five
minutes), and there is plenty of disk space available, we could collect
all sorts of data, if only we knew what to collect, and how that might
be relevant. However, "last second" might only be preserved if
output on screen, since disk sync & reboot from the debugger doesn't
always work, either.
If the idle session timeout tweak doesn't help, then I'm out of ideas.
(I'm trusting on Murphy's law to make it work by asking for help,
but now that I've mentioned it...)
open to any and all ideas on this,
Arto Selonen
--
#######======------ http://www.selonen.org/arto/ --------========########
Everstinkuja 5 B 35 Don't mind doing it.
FIN-02600 Espoo arto@selonen.org Don't mind not doing it.
Finland tel +358 50 560 4826 Don't know anything about it.