NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/39242: NetBSD 4.0 will start busy-loop an hang on machines with more than 4 GB memory
Hi - once again ...
I've removed 4 GB of memory now and the SCSI-controller seems to work
now, but ....
kernel: protection fault trap, code=0
Stopped in pid 28.1 (aiodoned) at netbsd:uvm_tree_RB_REMOVE+0x50:
movq %
r14,0x10(%r15)
db{0}> trace
uvm_tree_RB_REMOVE() at netbsd:uvm_tree_RB_REMOVE+0x50
uvm_rb_remove() at netbsd:uvm_rb_remove+0x1c
uvm_unmap_remove() at netbsd:uvm_unmap_remove+0x179
uvm_pagermapout() at netbsd:uvm_pagermapout+0x110
uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xc4
uvm_aiodone_daemon() at netbsd:uvm_aiodone_daemon+0xd2
db{0}>
At the time of the crash top reports:
load averages: 1.03, 0.47, 0.19
up 0 days, 0:38 12:27:08
67 processes: 1 runnable, 64 sleeping, 2 on processor
CPU0 states: 0.0% user, 0.0% nice, 11.9% system, 2.5% interrupt,
85.6% idle
CPU1 states: 0.0% user, 0.0% nice, 3.0% system, 0.0% interrupt,
97.0% idle
Memory: 2565M Act, 1253M Inact, 8924K Wired, 6296K Exec, 1316M File,
228K Free
Swap: 24G Total, 9435M Used, 15G Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
10722 root -5 0 868K 1792K biowai/0 0:13 78.93% 7.52% tar
23 root -6 0 0K 28M RUN/1 0:10 5.37% 5.37%
[raidio0]
26 root -18 0 0K 28M pgdaem/0 0:04 1.71% 1.71%
[pagedaemon]
28 root -18 0 0K 28M aiodon/0 0:04 1.17% 1.17%
[aiodoned]
21 root -6 0 0K 28M raidio/1 0:05 1.07% 1.07%
[raidio1]
9957 root 28 0 120K 820K CPU/1 0:00 0.00% 0.98% cp
22 root -6 0 0K 28M rfwcon/0 0:01 0.68% 0.68% [raid0]
20 root -6 0 0K 28M rfwcon/0 0:00 0.05% 0.05% [raid1]
9 root -6 0 0K 28M sccomp/1 0:02 0.00% 0.00%
[scsibus0]
198 ncadmin 28 0 572K 1588K CPU/0 0:00 0.00% 0.00% top
27 root 18 0 0K 28M syncer/0 0:00 0.00% 0.00%
[ioflush]
816 root 18 0 988K 4188K pause/0 0:00 0.00% 0.00% ntpd
9303 wgstuken 18 0 260K 1168K pause/1 0:00 0.00% 0.00% <csh>
10024 root 18 0 236K 1144K pause/1 0:00 0.00% 0.00% <csh>
1515 root 18 0 240K 1088K pause/1 0:00 0.00% 0.00% <csh>
18 root 14 0 0K 28M crypto/0 0:00 0.00% 0.00%
[cryptoret]
13 root 10 0 0K 28M usbevt/0 0:00 0.00% 0.00% [usb7]
12 root 10 0 0K 28M usbevt/0 0:00 0.00% 0.00% [usb6]
11 root 10 0 0K 28M usbevt/0 0:00 0.00% 0.00% [usb5]
10 root 10 0 0K 28M usbevt/0 0:00 0.00% 0.00% [usb4]
571 root 10 0 0K 28M nfsidl/0 0:00 0.00% 0.00% [nfsio]
570 root 10 0 0K 28M nfsidl/0 0:00 0.00% 0.00% [nfsio]
545 root 10 0 0K 28M nfsidl/0 0:00 0.00% 0.00% [nfsio]
499 root 10 0 0K 28M nfsidl/0 0:00 0.00% 0.00% [nfsio]
3 root 10 0 0K 28M usbevt/0 0:00 0.00% 0.00% [usb0]
4 root 10 0 0K 28M usbtsk/0 0:00 0.00% 0.00%
[usbtask-hc]
5 root 10 0 0K 28M usbtsk/0 0:00 0.00% 0.00%
[usbtask-dr]
6 root 10 0 0K 28M usbevt/0 0:00 0.00% 0.00% [usb1]
7 root 10 0 0K 28M usbevt/0 0:00 0.00% 0.00% [usb2]
8 root 10 0 0K 28M usbevt/0 0:00 0.00% 0.00% [usb3]
10303 root 10 0 484K 2768K wait/0 0:00 0.00% 0.00% <login>
195 root 10 0 620K 1944K wait/1 0:00 0.00% 0.00% <login>
1566 root 10 0 476K 1936K wait/0 0:00 0.00% 0.00% <login>
196 ncadmin 10 0 284K 1048K wait/0 0:00 0.00% 0.00% <sh>
1443 root 10 0 268K 880K nanosl/1 0:00 0.00% 0.00% <cron>
1 root 10 0 100K 812K wait/0 0:00 0.00% 0.00% <init>
557 root 2 0 1060K 3988K select/0 0:00 0.00% 0.00% <amd>
10740 root 2 0 492K 1536K poll/0 0:00 0.00% 0.00% <rlogind>
872 root 2 0 344K 1456K select/0 0:00 0.00% 0.00% <sshd>
1385 postfix 2 0 628K 1216K kqread/0 0:00 0.00% 0.00% <qmgr>
There seems to be a bigger problem in uvm as expected by me before!
db{0}> show uvmexp
Current UVM status:
pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
1018418 VM pages: 656891 active, 320807 inactive, 2230 wired, 5 free
pages 634067 anon, 344367 file, 1574 exec
freemin=64, free-target=85, wired-max=339472
faults=3609406, traps=3722187, intrs=7567575, ctxswitch=9890365
softint=380503, syscalls=14268860, swapins=245, swapouts=270
fault counts:
noram=24, noanon=0, pgwait=24, pgrele=0
ok relocks(total)=1148(1150), anget(retrys)=114762(394), amapcopy=55274
neighbor anon/obj pg=115362/837538, gets(lock/unlock)=198035/756
cases: anon=77487, anoncow=37248, obj=164994, prcopy=33039,
przero=98761
daemon and swap counts:
woke=15991, revs=10979, scans=2430998, obscans=2420929, anscans=2524
busy=0, freed=2423039, reactivate=730, deactivate=2791838
pageouts=156825, pending=2266288, nswget=297605
nswapdev=1, swpgavail=6291455
swpages=6291455, swpginuse=2422999, swpgonly=2125395, paging=78
db{0}> ps
PID PPID PGRP UID S FLAGS LWPS COMMAND
WAIT
9957 1515 9957 0 2 0x4002 1 cp
uvn_fp1
10722 10024 10722 0 2 0x4002 1 tar
10024 9303 10024 0 2 0x4002 1 csh
pause
9303 10303 9303 1002 2 0x4002 1 csh
pause
10303 10740 10303 0 2 0x4103 1 login
wait
10740 1262 1262 0 2 0x4100 1 rlogind
poll
198 196 198 500 2 0x4002 1 top
poll
196 195 196 500 2 0x4002 1 sh
wait
195 194 195 0 2 0x4103 1 login
wait
194 1262 1262 0 2 0x4100 1 rlogind
poll
1515 1566 1515 0 2 0x4002 1 csh
pause
1566 1 1566 0 2 0x4102 1 login
wait
1443 1 1443 0 2 0 1 cron
nanosle
1262 1 1262 0 2 0 1 inetd
kqread
1385 1263 1263 12 2 0x4108 1 qmgr
kqread
950 1263 1263 12 2 0x4108 1 pickup
kqread
1263 1 1263 0 2 0x4108 1 master
kqread
872 1 872 0 2 0 1 sshd
select
816 1 816 0 2 0 1 ntpd
pause
757 1 757 0 2 0 1 lpd
poll
631 1 631 0 2 0 1 rpc.lockd
select
672 1 672 0 2 0xa0008 1 rpc.statd
select
682 614 614 0 2 0 1 nfsd
nfsd
677 614 614 0 2 0 1 nfsd
nfsd
637 614 614 0 2 0 1 nfsd
nfsd
671 614 614 0 2 0 1 nfsd
nfsd
614 1 614 0 2 0 1 nfsd
poll
615 1 615 0 2 0 1 mountd
select
571 0 0 0 2 0x20200 1 nfsio
nfsidl
570 0 0 0 2 0x20200 1 nfsio
nfsidl
545 0 0 0 2 0x20200 1 nfsio
nfsidl
499 0 0 0 2 0x20200 1 nfsio
nfsidl
557 1 557 0 2 0 1 amd
select
505 1 505 0 2 0 1 ypbind
select
497 1 497 0 2 0 1 rpcbind
poll
574 1 574 0 2 0 1 syslogd
kqread
248 1 248 0 2 0 1 routed
select
108 0 0 0 2 0x20200 1 physiod
physiod
>28 0 0 0 2 0x20200 1 aiodoned
27 0 0 0 2 0x20200 1 ioflush
syncer
26 0 0 0 2 0x20200 1 pagedaemon
pgdaemo
25 0 0 0 2 0x20200 1 raidio2
raidiow
24 0 0 0 2 0x20200 1 raid2
rfwcond
23 0 0 0 2 0x20200 1 raidio0
22 0 0 0 2 0x20200 1 raid0
21 0 0 0 2 0x20200 1 raidio1
20 0 0 0 2 0x20200 1 raid1
rfwcond
19 0 0 0 2 0x20200 1 atapibus0
sccomp
18 0 0 0 2 0x20200 1 cryptoret
crypto_
17 0 0 0 2 0x20200 1 atabus3
atath
16 0 0 0 2 0x20200 1 atabus2
atath
15 0 0 0 2 0x20200 1 atabus1
atath
14 0 0 0 2 0x20200 1 atabus0
atath
13 0 0 0 2 0x20200 1 usb7
usbevt
12 0 0 0 2 0x20200 1 usb6
usbevt
11 0 0 0 2 0x20200 1 usb5
usbevt
10 0 0 0 2 0x20200 1 usb4
usbevt
9 0 0 0 2 0x20200 1 scsibus0
sccomp
8 0 0 0 2 0x20200 1 usb3
usbevt
7 0 0 0 2 0x20200 1 usb2
usbevt
6 0 0 0 2 0x20200 1 usb1
usbevt
5 0 0 0 2 0x20200 1 usbtask-dr
usbtsk
4 0 0 0 2 0x20200 1 usbtask-hc
usbtsk
3 0 0 0 2 0x20200 1 usb0
usbevt
2 0 0 0 2 0x20200 1 sysmon
smtaskq
1 0 1 0 2 0x4001 1 init
wait
0 -1 0 0 2 0x20200 1 swapper
schedul
db{0}>
At the time of the crash "tar" is extracting an archive into one
filesystem and "cp" is copying a large file into another filesystem -
both on the same raid-device (raid1 on SATA) - but that has caused no
problem in the past.
The source of both commands is in /tmp (tmpfs) - ca. 11 GB resides in /tmp.
Is tmpfs known to be stable? Or may the problem be there?
Any Idea how to debug further on?
I will see if I'm able to replace tmpfs by a real filesystem and try
again ...
By the way: I've recognized another problem in uvm_plistalloc_simple()
by thinking about the strategie used there.
If some processes tries to allocate more more memory in sum as the
system has at all, the strategie used there may deadlock the whole system.
If e.g. 30 % of memory is given to each of tree processes that requests
each 50% of the whole memory, it will be impossible for any of them to
complete because the required memory is locked by the other processes
and cannot be stolen there (there is no way to manipulate the local
variable "num" from outside ...). It would be possible to satisfy the
request of the first on and after it has freed the memory again the next
and so on. Therefore a way is needed to "steal" steal the memory from
waiting processes again.
This would be a very rare case and strange situation, but it is not
detected by the system at the moment. I'm not shure if there is an
inexpensive and easy way to detect such a situation, but this "possible"
problem should be at least documented in a comment in the source file.
best regards
W. Stukenbrock
PS. I've removed the privious mail content - it is in the gnats system
and can be reviewed there.
Home |
Main Index |
Thread Index |
Old Index