Subject: kern/28865: panic in in_cksum()
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <dokas@cs.umn.edu>
List: netbsd-bugs
Date: 01/04/2005 19:24:00
>Number: 28865
>Category: kern
>Synopsis: panic in in_cksum()
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Jan 04 19:24:00 +0000 2005
>Originator: Paul Dokas
>Release: NetBSD 2.99.11
>Organization:
University of Computer Science, Dept of Computer Science
>Environment:
System: NetBSD host.cs.umn.edu 2.99.11 NetBSD 2.99.11 (HOST) #10: Thu Dec 23 10:13:41 CST 2004 root@host.cs.umn.edu:/usr/obj/sys/arch/i386/compile/HOST i386
Architecture: i386
Machine: i386
>Description:
I've got a host that is a syslog collector that is crashing under load in in_cksum().
This started happening after I rebuilt the system on Dec 20, 2004. I suspect that it's
related to the checksumming related changes that happened in the first half of Dec.
Here's the panic information (copied by hand):j
kernel: page fault trap, code=0
stopped in pid 10628.1 (logsurfer) at netbsd:in_cksum+0x9e adcl 0x1c(%eb), %eax
db> bt
in_cksum(ca52f000,0,5dc,9000001,13092600) at netbsd:in_cksum+0x9e
?(c124b534,0,0,2,1c082500) at 0
?(c124c820,0,0,0,1c1b8800) at 0
?(c12cc620,48101180,0,0,11e15600) at 0
?(c130bd20,0,0,0,3c17d00) at 0
Bad frame pointer: 0c130b500
db> show reg
dx 0x10
es 0x10
fs 0x30
gs 0x10
edi 0x14
esi 0
ebp 0xc0fed600 pnpbios_softc+0xbcf37c
ebx 0xc0ffafe4 pnpbios_softc+0xbdcd60
edx 0xbc
ecx 0xc0ff6d00 pnpbios_softc+0xbd8a7c
eax 0x17626744
eip 0xc029899a in_cksum+0x9e
cs 0x8
eflags 0x10217
esp 0xcb4bfb44 pnpbios_softc+0xb0518c0
ss 0x10
Here's a little more background on this machine:
+ it's collecting syslogs from around a 1,000 other computers
+ it's got a couple of IPSEC tunnels similar to this:
spdadd 128.101.this.host/32 146.57.that.host/32 any -P out ipsec esp/transport//require ah/transport//require;
spdadd 146.57.that.host/32 128.101.this.host/32 any -P in ipsec esp/transport//require ah/transport//require;
+ it's using IPFilter to implement a "no inbound except for syslog and ssh" policy
+ it's using IPNat to get allow outbound RSH:
map fxp0 0.0.0.0/0 -> 128.101.this.host/32 proxy port shell rcmd/tcp
Also, these panics look to be load related. They were happening pretty consistantly
at 5:20am every morning until I moved the crontab that fired off at that time. Now
the panics are happening a fairly random times, but they only seem to happen when the
machine's load goes above 2.0
And finally, when it does panic, the backtrace is always different, but always looks
like it's been corrupted somehow (to my uneducated eye that is). If I had to guess, I'd
say that this looks like a stack corruption of some sort.
>How-To-Repeat:
Build with -current and attempt to collect syslogs from a few thousand hosts.
>Fix:
Sorry, I don't know.