Subject: Kernel Deadlock under stress
To: None <tech-kern@netbsd.org>
From: Huy Vu <hvu@hyperchip.com>
List: tech-kern
Date: 02/14/2003 15:20:29
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
------_=_NextPart_001_01C2D466.849D01B0
Content-Type: text/plain;
charset="iso-8859-1"
Here is our env:
Netbsd 1.6 (updated with all the patches since Dec 15)
CPU: MAC PowerPC
Root filesystem: NFS Root
-----
We have two process that communicate over SHARE memory and semaphore. Under
stress the kernel seems to deadlock, triggered by these two processes. At
this point the system just lock up and no response to ping or anything. I
was able to break into ddb and grab the following traces.
I verified that in deed the kernel has plenty of memory left, and if my
interpretation is correct is that the kernel is in "Idle" but in a deadlock
state.
------
Any help would be appreciated.
Thanks
Huy
-----------------
db> ps /n
PID PPID PGRP UID S FLAGS COMMAND
WAIT
...
67 49 67 0 3 0x4004 proc1
vnlock
...
65 49 65 0 3 0x4004 proc2
netio
db> trace /t 0t65
trace: pid 65 at 0xd9c37720
0xd9c37780: at cpu_switch+ac
0xd9c37790: at mi_switch+30c
0xd9c377c0: at ltsleep+488
0xd9c37800: at sbwait+78
0xd9c37820: at soreceive+5b4
0xd9c378a0: at nfs_receive+7b4
0xd9c37920: at nfs_reply+8c
0xd9c37970: at nfs_request+71c
0xd9c37a50: at nfs_getattr+7c4
0xd9c37ac0: at VOP_GETATTR+80
0xd9c37b10: at nfs_lookup+328
0xd9c37c40: at VOP_LOOKUP+74
0xd9c37c80: at lookup+624
0xd9c37ce0: at namei+268
0xd9c37d40: at vn_open+254
0xd9c37df0: at sys_open+124
0xd9c37eb0: at trap+a60
0xd9c37f50: user SC trap by 0x419237d8: srr1=0xd032
r1=0x7fffe750 cr=0x40000005 xer=0x20000000 ctr=0
db> trace /t 0t67
trace: pid 67 at 0xd9c3f910
0xd9c3f970: at cpu_switch+ac
0xd9c3f980: at mi_switch+30c
0xd9c3f9b0: at ltsleep+488
0xd9c3f9f0: at _lockmgr+12d8
0xd9c3fa40: at genfs_lock+50
0xd9c3fa60: at VOP_LOCK+68
0xd9c3faa0: at vn_lock+cc
0xd9c3fac0: at vget+224
0xd9c3fae0: at cache_lookup+4c4
0xd9c3fb10: at nfs_lookup+160
0xd9c3fc40: at VOP_LOOKUP+74
0xd9c3fc80: at lookup+624
0xd9c3fce0: at namei+268
0xd9c3fd40: at vn_open+254
0xd9c3fdf0: at sys_open+124
0xd9c3feb0: at trap+a60
0xd9c3ff50: user SC trap by 0x419737d8: srr1=0xd032
r1=0x7fffe7a0 cr=0x44000005 xer=0x20000000 ctr=0
db> show uvm
Current UVM status:
pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
247221 VM pages: 46496 active, 0 inactive, 325 wired, 194909 free
min 10% (25) anon, 10% (25) file, 5% (12) exec
max 80% (204) anon, 50% (128) file, 30% (76) exec
pages 38612 anon, 5310 file, 5882 exec
freemin=64, free-target=85, inactive-target=0, wired-max=164814
faults=88109, traps=1475879, intrs=195372, ctxswitch=111340
softint=100159, syscalls=1291722, swapins=0, swapouts=0
fault counts:
noram=0, noanon=0, pgwait=0, pgrele=0
ok relocks(total)=2288(2290), anget(retrys)=28503(0), amapcopy=4066
neighbor anon/obj pg=7094/55535, gets(lock/unlock)=20904/2290
cases: anon=25014, anoncow=3489, obj=16233, prcopy=4669, przero=37163
daemon and swap counts:
woke=0, revs=0, scans=0, obscans=0, anscans=0
busy=0, freed=0, reactivate=0, deactivate=0
pageouts=0, pending=0, nswget=0
nswapdev=0, nanon=247198, nanonneeded=247198 nfreeanon=211814
swpages=0, swpginuse=0, swpgonly=0 paging=0
db> trace /f
0x002249d8: at comintr+1a8(d0010800, d0010800, 210614, f0100000, d0084024,
d0085
000, 800, f9c65b14)
0x00224a18: at ext_intr_openpic+76c(1, 11, 0, 1032, 2, d0085fe0, 3c149, 0)
0x00224a58: at extint_call+0(9032, 5150, 0, 0, 0, 419ffcfc, 0, 4000d032)
0xd9c47d20: at Idle+10(418e4e0c, d9c47d30, d9c47d60, b43e8, 13faa298,
26c8a8, 26
6668, 0)
0xd9c47d30: at mi_switch+30c(13faa298, 26c8a8, 266668, 0, 7be48, 0,
3e4d25b0, 38
441)
0xd9c47d60: at ltsleep+488(22b5c0, 118, 201ef4, b, 0, 13faa298, 2686a8, 0)
0xd9c47da0: at sys_select+448(13faa298, d9c47f64, d9c47ee0, d9c47f64,
d9c47dc8, d
9c47e98, 8, cfffe928)
0xd9c47eb0: at trap+a60(d9c47f58, 13faa298, 10c00, 5, 0, 0, 5d, 14)
0xd9c47f50: user SC trap by 0x418e4e0c: srr1=0xd032
r1=0x7fffe930 cr=0x22000002 xer=0 ctr=0x419ffcfc
------_=_NextPart_001_01C2D466.849D01B0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
5.5.2653.12">
<TITLE>Kernel Deadlock under stress</TITLE>
</HEAD>
<BODY>
<P><FONT SIZE=3D2>Here is our env:</FONT>
</P>
<P><FONT SIZE=3D2>Netbsd 1.6 (updated with all the patches since Dec =
15)</FONT>
<BR><FONT SIZE=3D2>CPU: MAC PowerPC</FONT>
<BR><FONT SIZE=3D2>Root filesystem: NFS Root</FONT>
<BR><FONT SIZE=3D2>-----</FONT>
<BR><FONT SIZE=3D2>We have two process that communicate over SHARE =
memory and semaphore. Under stress the kernel seems to deadlock, =
triggered by these two processes. At this point the system just =
lock up and no response to ping or anything. I was able to break =
into ddb and grab the following traces.</FONT></P>
<P><FONT SIZE=3D2>I verified that in deed the kernel has plenty of =
memory left, and if my interpretation is correct is that the =
kernel is in "Idle" but in a deadlock state.</FONT></P>
<P><FONT SIZE=3D2>------</FONT>
<BR><FONT SIZE=3D2>Any help would be appreciated.</FONT>
</P>
<P><FONT SIZE=3D2>Thanks</FONT>
<BR><FONT SIZE=3D2>Huy</FONT>
</P>
<P><FONT SIZE=3D2>-----------------</FONT>
<BR><FONT SIZE=3D2>db> ps /n</FONT>
<BR><FONT =
SIZE=3D2> PID =
PPID =
PGRP UID S =
FLAGS =
COMMAND WAIT</FONT>
<BR><FONT SIZE=3D2>...</FONT>
<BR><FONT =
SIZE=3D2> 67 &=
nbsp; =
49 =
67 0 3 =
0x4004 =
proc1 vnlock</FONT>
<BR><FONT SIZE=3D2>...</FONT>
<BR><FONT =
SIZE=3D2> 65 &=
nbsp; =
49 =
65 0 3 =
0x4004 =
proc2 netio</FONT>
</P>
<P><FONT SIZE=3D2>db> trace /t 0t65</FONT>
<BR><FONT SIZE=3D2>trace: pid 65 at 0xd9c37720</FONT>
<BR><FONT SIZE=3D2>0xd9c37780: at cpu_switch+ac</FONT>
<BR><FONT SIZE=3D2>0xd9c37790: at mi_switch+30c</FONT>
<BR><FONT SIZE=3D2>0xd9c377c0: at ltsleep+488</FONT>
<BR><FONT SIZE=3D2>0xd9c37800: at sbwait+78</FONT>
<BR><FONT SIZE=3D2>0xd9c37820: at soreceive+5b4</FONT>
<BR><FONT SIZE=3D2>0xd9c378a0: at nfs_receive+7b4</FONT>
<BR><FONT SIZE=3D2>0xd9c37920: at nfs_reply+8c</FONT>
<BR><FONT SIZE=3D2>0xd9c37970: at nfs_request+71c</FONT>
<BR><FONT SIZE=3D2>0xd9c37a50: at nfs_getattr+7c4</FONT>
<BR><FONT SIZE=3D2>0xd9c37ac0: at VOP_GETATTR+80</FONT>
<BR><FONT SIZE=3D2>0xd9c37b10: at nfs_lookup+328</FONT>
<BR><FONT SIZE=3D2>0xd9c37c40: at VOP_LOOKUP+74</FONT>
<BR><FONT SIZE=3D2>0xd9c37c80: at lookup+624</FONT>
<BR><FONT SIZE=3D2>0xd9c37ce0: at namei+268</FONT>
<BR><FONT SIZE=3D2>0xd9c37d40: at vn_open+254</FONT>
<BR><FONT SIZE=3D2>0xd9c37df0: at sys_open+124</FONT>
<BR><FONT SIZE=3D2>0xd9c37eb0: at trap+a60</FONT>
<BR><FONT SIZE=3D2>0xd9c37f50: user SC trap by 0x419237d8: =
srr1=3D0xd032</FONT>
<BR><FONT =
SIZE=3D2> &nb=
sp; r1=3D0x7fffe750 cr=3D0x40000005 xer=3D0x20000000 ctr=3D0</FONT>
<BR><FONT SIZE=3D2>db> trace /t 0t67</FONT>
<BR><FONT SIZE=3D2>trace: pid 67 at 0xd9c3f910</FONT>
<BR><FONT SIZE=3D2>0xd9c3f970: at cpu_switch+ac</FONT>
<BR><FONT SIZE=3D2>0xd9c3f980: at mi_switch+30c</FONT>
<BR><FONT SIZE=3D2>0xd9c3f9b0: at ltsleep+488</FONT>
<BR><FONT SIZE=3D2>0xd9c3f9f0: at _lockmgr+12d8</FONT>
<BR><FONT SIZE=3D2>0xd9c3fa40: at genfs_lock+50</FONT>
<BR><FONT SIZE=3D2>0xd9c3fa60: at VOP_LOCK+68</FONT>
<BR><FONT SIZE=3D2>0xd9c3faa0: at vn_lock+cc</FONT>
<BR><FONT SIZE=3D2>0xd9c3fac0: at vget+224</FONT>
<BR><FONT SIZE=3D2>0xd9c3fae0: at cache_lookup+4c4</FONT>
<BR><FONT SIZE=3D2>0xd9c3fb10: at nfs_lookup+160</FONT>
<BR><FONT SIZE=3D2>0xd9c3fc40: at VOP_LOOKUP+74</FONT>
<BR><FONT SIZE=3D2>0xd9c3fc80: at lookup+624</FONT>
<BR><FONT SIZE=3D2>0xd9c3fce0: at namei+268</FONT>
<BR><FONT SIZE=3D2>0xd9c3fd40: at vn_open+254</FONT>
<BR><FONT SIZE=3D2>0xd9c3fdf0: at sys_open+124</FONT>
<BR><FONT SIZE=3D2>0xd9c3feb0: at trap+a60</FONT>
<BR><FONT SIZE=3D2>0xd9c3ff50: user SC trap by 0x419737d8: =
srr1=3D0xd032</FONT>
<BR><FONT =
SIZE=3D2> &nb=
sp; r1=3D0x7fffe7a0 cr=3D0x44000005 xer=3D0x20000000 ctr=3D0</FONT>
</P>
<P><FONT SIZE=3D2>db> show uvm</FONT>
<BR><FONT SIZE=3D2>Current UVM status:</FONT>
<BR><FONT SIZE=3D2> pagesize=3D4096 (0x1000), pagemask=3D0xfff, =
pageshift=3D12</FONT>
<BR><FONT SIZE=3D2> 247221 VM pages: 46496 active, 0 inactive, =
325 wired, 194909 free</FONT>
<BR><FONT SIZE=3D2> min 10% (25) anon, 10% (25) file, 5% =
(12) exec</FONT>
<BR><FONT SIZE=3D2> max 80% (204) anon, 50% (128) file, 30% =
(76) exec</FONT>
<BR><FONT SIZE=3D2> pages 38612 anon, 5310 file, 5882 =
exec</FONT>
<BR><FONT SIZE=3D2> freemin=3D64, free-target=3D85, =
inactive-target=3D0, wired-max=3D164814</FONT>
<BR><FONT SIZE=3D2> faults=3D88109, traps=3D1475879, =
intrs=3D195372, ctxswitch=3D111340</FONT>
<BR><FONT SIZE=3D2> softint=3D100159, syscalls=3D1291722, =
swapins=3D0, swapouts=3D0</FONT>
<BR><FONT SIZE=3D2> fault counts:</FONT>
<BR><FONT SIZE=3D2> noram=3D0, noanon=3D0, =
pgwait=3D0, pgrele=3D0</FONT>
<BR><FONT SIZE=3D2> ok relocks(total)=3D2288(2290), =
anget(retrys)=3D28503(0), amapcopy=3D4066</FONT>
<BR><FONT SIZE=3D2> neighbor anon/obj =
pg=3D7094/55535, gets(lock/unlock)=3D20904/2290</FONT>
<BR><FONT SIZE=3D2> cases: anon=3D25014, =
anoncow=3D3489, obj=3D16233, prcopy=3D4669, przero=3D37163</FONT>
<BR><FONT SIZE=3D2> daemon and swap counts:</FONT>
<BR><FONT SIZE=3D2> woke=3D0, revs=3D0, scans=3D0, =
obscans=3D0, anscans=3D0</FONT>
<BR><FONT SIZE=3D2> busy=3D0, freed=3D0, =
reactivate=3D0, deactivate=3D0</FONT>
<BR><FONT SIZE=3D2> pageouts=3D0, pending=3D0, =
nswget=3D0</FONT>
<BR><FONT SIZE=3D2> nswapdev=3D0, nanon=3D247198, =
nanonneeded=3D247198 nfreeanon=3D211814</FONT>
<BR><FONT SIZE=3D2> swpages=3D0, swpginuse=3D0, =
swpgonly=3D0 paging=3D0</FONT>
<BR><FONT SIZE=3D2>db> trace /f</FONT>
<BR><FONT SIZE=3D2>0x002249d8: at comintr+1a8(d0010800, d0010800, =
210614, f0100000, d0084024, d0085</FONT>
<BR><FONT SIZE=3D2>000, 800, f9c65b14)</FONT>
<BR><FONT SIZE=3D2>0x00224a18: at ext_intr_openpic+76c(1, 11, 0, 1032, =
2, d0085fe0, 3c149, 0)</FONT>
<BR><FONT SIZE=3D2>0x00224a58: at extint_call+0(9032, 5150, 0, 0, 0, =
419ffcfc, 0, 4000d032)</FONT>
<BR><FONT SIZE=3D2>0xd9c47d20: at Idle+10(418e4e0c, d9c47d30, d9c47d60, =
b43e8, 13faa298, 26c8a8, 26</FONT>
<BR><FONT SIZE=3D2>6668, 0)</FONT>
<BR><FONT SIZE=3D2>0xd9c47d30: at mi_switch+30c(13faa298, 26c8a8, =
266668, 0, 7be48, 0, 3e4d25b0, 38</FONT>
<BR><FONT SIZE=3D2>441)</FONT>
<BR><FONT SIZE=3D2>0xd9c47d60: at ltsleep+488(22b5c0, 118, 201ef4, b, =
0, 13faa298, 2686a8, 0)</FONT>
<BR><FONT SIZE=3D2>0xd9c47da0: at sys_select+448(13faa298, d9c47f64, =
d9c47ee0, d9c47f64, d9c47dc8, d</FONT>
<BR><FONT SIZE=3D2>9c47e98, 8, cfffe928)</FONT>
<BR><FONT SIZE=3D2>0xd9c47eb0: at trap+a60(d9c47f58, 13faa298, 10c00, =
5, 0, 0, 5d, 14)</FONT>
<BR><FONT SIZE=3D2>0xd9c47f50: user SC trap by 0x418e4e0c: =
srr1=3D0xd032</FONT>
<BR><FONT =
SIZE=3D2> &nb=
sp; r1=3D0x7fffe930 cr=3D0x22000002 xer=3D0 ctr=3D0x419ffcfc</FONT>
</P>
</BODY>
</HTML>
------_=_NextPart_001_01C2D466.849D01B0--