NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/57621: Updated 10.0_BETA macppc MP kernel prone to hangs
>Number: 57621
>Category: kern
>Synopsis: Updated 10.0_BETA macppc MP kernel prone to hangs
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Sep 19 21:40:01 +0000 2023
>Originator: he%NetBSD.org@localhost
>Release: NetBSD 10.0_BETA
>Organization:
I Try...
>Environment:
System: (it's still wedged, so can't do "uname -a", but it's a
relatively recent macppc 10.0_BETA kernel and user-land)
Architecture: powerpc
Machine: macppc
>Description:
It appears that a reasonably up-to-date 10.0_BETA kernel is
prone to hangs on this particular system. A -current kernel
built from sources dating June 8 2022 appears to be more
stable than a 10.0_BETA kernel from e.g. April 10 2023 or a
more recently updated 10.0_BETA kernel.
The annoying thing is that I can't get DDB to work on this
system.
The symptom is that simply doing a "build.sh -j 2 ... release"
job will with nearly probability 1 trigger a wedge where a few
processes get stuck in "tstile" state, and this disease then
spreads to other activity on the machine which grinds to a
wedge.
However, not *everything* comes to a halt, e.g. a remote xterm
with 'top' running may continue to run. Same for 'systat vm',
and also same with 'crash'.
A fairly typical 'top' output (when it's wedged, cut+pasted
just now, while the host wedged a couple of days ago):
load averages: 0.01, 0.02, 0.00; up 2+11:22:17 23:27:20
83 processes: 81 sleeping, 2 on CPU
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Memory: 911M Act, 560M Inact, 14M Wired, 23M Exec, 1368M File, 249M Free
Swap: 2000M Total, 40K Used, 2000M Free / Pools: 235M Used
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
0 root 43 0 0K 83M dbcool/1 11:17 1.90% 1.90% [system]
3361 he 85 0 9996K 1652K ttyraw/0 11:49 0.00% 0.00% systat
1242 he 85 0 17M 6308K select/1 5:14 0.00% 0.00% xterm
1851 he 43 0 10M 2180K CPU/1 2:28 0.00% 0.00% top
1255 he 85 0 24M 6696K poll/0 1:56 0.00% 0.00% sshd
1200 he 85 0 17M 3828K select/1 1:36 0.00% 0.00% xterm
1229 he 85 0 33M 2836K poll/0 1:26 0.00% 0.00% sshd
5367 root 43 0 8024K 2508K CPU/0 0:55 0.00% 0.00% top
1217 he 85 0 17M 7120K select/1 0:20 0.00% 0.00% xterm
1253 he 85 0 23M 1632K poll/1 0:17 0.00% 0.00% sshd
800 root 85 0 15M 14M pause/0 0:10 0.00% 0.00% ntpd
207 root 83 4 9564K 648K poll/0 0:08 0.00% 0.00% nbmake
13699 he 85 0 22M 6132K poll/0 0:06 0.00% 0.00% sshd
24480 root 85 0 6284K 320K kqueue/0 0:06 0.00% 0.00% tail
18249 he 117 0 18M 4912K tstile/0 0:05 0.00% 0.00% xterm
8682 he 85 0 23M 2252K poll/1 0:05 0.00% 0.00% sshd
20207 root 83 4 9404K 2724K poll/0 0:04 0.00% 0.00% nbmake
2217 root 83 4 9404K 2656K poll/0 0:04 0.00% 0.00% nbmake
27442 he 85 0 17M 3260K select/1 0:03 0.00% 0.00% xterm
24118 he 117 0 23M 1748K tstile/1 0:02 0.00% 0.00% sshd
13350 root 85 0 25M 3196K ttyraw/1 0:01 0.00% 0.00% crash
1085 root 117 0 19M 1192K tstile/0 0:00 0.00% 0.00% master
1075 root 117 0 7824K 1068K tstile/0 0:00 0.00% 0.00% cron
1254 postfix 117 0 20M 828K tstile/1 0:00 0.00% 0.00% qmgr
2221 root 108 4 4112K 420K tstile/1 0:00 0.00% 0.00% nbsed
1105 root 108 4 2016K 216K tstile/0 0:00 0.00% 0.00% sh
27957 root 108 4 1608K 176K tstile/0 0:00 0.00% 0.00% nbgrep
9720 root 85 0 22M 9396K poll/1 0:00 0.00% 0.00% sshd
12768 root 85 0 22M 9296K poll/0 0:00 0.00% 0.00% sshd
1218 root 85 0 22M 9188K poll/0 0:00 0.00% 0.00% sshd
1236 root 85 0 22M 8916K poll/1 0:00 0.00% 0.00% sshd
20726 he 85 0 17M 6580K select/1 0:00 0.00% 0.00% xterm
12983 he 85 0 23M 6496K poll/0 0:00 0.00% 0.00% sshd
...
A couple of things can be read out here:
* Several processes are stuck in 'tstile' wait channel (as
described above).
* There does not *appear* to be a drastic physical memory
shortfall, and swap is barely used.
The last point made me think that "this can't be VM related",
but it looks like I was wrong.
I also managed to collect "ps" and, I think, "bt/a" traces of
all the processes stuck in 'tstile' at the time of this run.
The output from those can be seen here:
crash> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
1105 1105 3 0 0 3347a100 sh tstile
27957 27957 3 0 0 231a5c40 nbgrep tstile
2221 2221 3 1 0 1d2b5380 nbsed tstile
15261 15261 3 0 180 534e3080 sh wait
23256 23256 3 0 180 61dbc3c0 nbmake poll
10572 10572 3 0 180 57015d00 sh wait
17709 17709 3 0 180 34e79a00 nbmake poll
22661 22661 3 0 180 107803c0 nbmake wait
10847 10847 3 1 0 695a10c0 pickup tstile
15598 15598 3 1 180 5c5666c0 sh wait
28627 28627 3 0 180 23276d00 nbmake poll
25303 25303 3 1 180 5c5660c0 sh wait
29847 29847 3 0 180 61dbc6c0 nbmake poll
5367 5367 3 0 180 34e79d00 top select
320 320 3 1 180 1d2b5c80 tcsh pause
2526 2526 3 0 180 15287d00 tcsh pause
13699 13699 3 1 180 15287700 sshd poll
9720 9720 3 1 180 1389ec80 sshd poll
21947 21947 3 1 180 1d7420c0 crash ttyraw
28553 28553 3 0 180 10778040 tcsh pause
12880 12880 3 1 180 21080c40 tcsh pause
1679 1679 3 0 180 5c5663c0 sshd poll
1191 1191 3 1 180 21125080 sshd poll
19615 19615 3 1 180 3343c040 sh wait
2217 2217 3 0 180 3347a400 nbmake poll
1584 1584 3 1 180 3347aa00 sh wait
20207 20207 3 0 180 10780cc0 nbmake poll
11398 11398 3 0 180 61dbccc0 sh wait
207 207 3 0 180 534e3380 nbmake poll
13301 13301 3 0 180 3347ad00 sh wait
24480 24480 3 0 180 205d9940 tail kqueue
23505 23505 3 0 180 22168cc0 tcsh pause
21730 21730 3 1 180 1d7426c0 tcsh ttyraw
20726 20726 3 1 180 5c5669c0 xterm select
7409 7409 3 0 180 34e79100 tcsh pause
12983 12983 3 0 180 23276100 sshd poll
24443 24443 3 1 180 534e3680 sshd poll
13350>13350 7 0 100 57015400 crash
17585 17585 3 0 180 3343c940 tcsh ttyraw
14989 14989 3 0 180 21080340 tcsh pause
27442 27442 3 0 1c0 61dbc0c0 xterm select
27802 27802 3 0 180 1d2b5080 tcsh pause
24118 24118 3 0 180 215b50c0 sshd poll
12768 12768 3 0 180 1d742cc0 sshd poll
8163 8163 3 0 180 231a5340 tcsh pause
10465 10465 3 0 180 21125c80 tcsh pause
18249 18249 3 1 1c0 21080040 xterm poll
10566 10566 3 0 180 21125380 tcsh pause
8682 8682 3 1 180 215b5cc0 sshd poll
10625 10625 3 1 180 34e79400 sshd poll
3361 3361 3 0 180 1d2b5980 systat ttyraw
1851 1851 3 0 180 3343cc40 top select
1103 1103 3 1 180 10778c40 tcsh pause
1378 1378 3 0 180 221689c0 tcsh pause
1200 1200 3 1 180 10338700 xterm select
1277 1277 3 0 180 10778940 tcsh pause
1319 1319 3 0 180 10338d00 tcsh pause
1229 1229 3 1 180 10778640 sshd poll
1241 1241 3 0 180 102ae9c0 tcsh pause
1243 1243 3 1 180 1389e980 sshd poll
1242 1242 3 0 180 107800c0 xterm select
1232 1232 3 0 180 10338a00 tcsh pause
1255 1255 3 0 180 21c79c80 sshd poll
1236 1236 3 1 180 221683c0 sshd poll
1217 1217 3 0 180 1389e380 xterm select
1223 1223 3 0 180 221686c0 tcsh pause
1253 1253 3 0 180 1389e080 sshd poll
1218 1218 3 0 180 221680c0 sshd poll
1084 1084 3 0 180 21c79980 getty ttyraw
1112 1112 3 0 180 21c79680 getty nanoslp
1066 1066 3 0 180 21c79380 getty nanoslp
1275 1275 3 0 180 21c79080 getty nanoslp
1086 1086 3 1 180 1022f100 getty ttyraw
1075 1075 3 0 40 205d9c40 cron tstile
1070 1070 3 1 180 205d9640 inetd kqueue
1254 1254 3 0 180 205d9340 qmgr kqueue
1085 1085 3 0 1000040 10778340 master tstile
825 825 3 0 180 15287100 sshd poll
708 708 3 0 180 107809c0 powerd kqueue
800 800 3 0 180 107806c0 ntpd pause
604 604 3 0 180 1389e680 syslogd kqueue
1 1 3 0 180 102ae0c0 init wait
0 198 3 1 200 102aecc0 physiod physiod
0 166 3 1 200 1022f700 pooldrain pooldrain
0 165 3 0 200 10338400 ioflush syncer
0 164 3 0 200 10338100 pgdaemon pgdaemon
0 126 3 1 200 7f87cc80 swwreboot swwreboot
0 124 3 1 200 102ae6c0 atapibus0 sccomp
0 122 3 0 200 102196c0 usb0 usbevt
0 121 3 0 200 102ae3c0 usb1 usbevt
0 120 3 0 200 1022f400 npfgc0 npfgcw
0 119 3 1 200 1028cc80 rt_free rt_free
0 118 3 1 200 1028c980 unpgc unpgc
0 117 3 0 200 1028c680 key_timehandler key_timehandler
0 116 3 1 200 1028c380 icmp6_wqinput/1 icmp6_wqinput
0 115 3 0 200 1028c080 icmp6_wqinput/0 icmp6_wqinput
0 114 3 0 200 1027ec40 nd6_timer nd6_timer
0 113 3 1 200 1027e940 carp6_wqinput/1 carp6_wqinput
0 112 3 0 200 1027e640 carp6_wqinput/0 carp6_wqinput
0 111 3 1 200 1027e340 carp_wqinput/1 carp_wqinput
0 110 3 0 200 1027e040 carp_wqinput/0 carp_wqinput
0 109 3 1 200 1022fd00 icmp_wqinput/1 icmp_wqinput
0 108 3 0 200 102199c0 icmp_wqinput/0 icmp_wqinput
0 107 3 0 200 1022fa00 rt_timer rt_timer
0 106 3 0 200 10219cc0 vmem_rehash vmem_rehash
0 97 3 1 200 102193c0 lmtemp0 lmtemp0
0 96 3 0 200 102190c0 dbcool0 dbcool0
0 30 3 0 200 7f87c980 entbutler entropy
0 29 3 1 380 7f87c680 fw0probe ieee1394
0 28 3 0 240 7f87c380 atabus2 atath
0 27 3 1 200 7f87c080 usbtask-dr usbtsk
0 26 3 0 200 7f901c40 usbtask-hc usbtsk
0 25 3 1 240 7f901940 atabus1 atath
0 24 3 1 200 7f901640 atabus0 atath
0 23 3 0 200 7f901340 pmu wait
0 22 3 1 200 7f901040 xcall/1 xcall
0 21 1 1 200 7f90ed00 softser/1
0 20 1 1 40200 7f90ea00 softclk/1
0 19 1 1 200 7f90e700 softbio/1
0 18 1 1 200 7f90e400 softnet/1
0 17 1 1 201 7f90e100 idle/1
0 16 3 0 200 7f91acc0 sysmon smtaskq
0 15 3 0 200 7f91a9c0 pmfsuspend pmfsuspend
0 14 3 0 200 7f91a6c0 pmfevent pmfevent
0 13 3 0 200 7f91a3c0 sopendfree sopendfr
0 12 3 0 200 7f91a0c0 ifwdog ifwdog
0 11 3 0 200 7fb26c80 iflnkst iflnkst
0 10 3 0 200 7fb26980 nfssilly nfssilly
0 9 3 1 240 7fb26680 vdrain vdrain
0 8 3 0 200 7fb26380 modunload mod_unld
0 7 3 0 200 7fb26080 xcall/0 xcall
0 6 1 0 200 7fb30c40 softser/0
0 5 1 0 40200 7fb30940 softclk/0
0 4 1 0 200 7fb30640 softbio/0
0 3 1 0 40200 7fb30340 softnet/0
0 > 2 1 0 201 7fb30040 idle/0
0 0 3 1 200 c32dc0 swapper uvm
crash>
crash> bt/a 1027ec40
trace: pid 0 lid 114 at 0x10267de0
0x10267e40: at cpu_switchto+0x28
0x10267e50: at mi_switch+0x140
0x10267e90: at sleepq_block+0xe0
0x10267eb0: at cv_wait+0xc0
0x10267ee0: at workqueue_worker+0x12c
0x10267f20: at cpu_lwp_bootstrap+0xc
0x10267fe8: kernel trap 0 by __end+0xf5fabf8: srr1=0x111f8340
r1=0x2 cr=0x5 xer=0x40000 ctr=0xfffffff0
crash> bt/a 1d2b5380
trace: pid 2221 lid 2221 at 0x4da13bc0
0x4da13c20: at cpu_switchto+0x28
0x4da13c30: at mi_switch+0x140
0x4da13c70: at sleepq_block+0xe0
0x4da13c90: at turnstile_block+0x284
0x4da13cf0: at rw_enter+0x124
0x4da13d30: at uvm_fault_internal+0xa3c
0x4da13e70: at trap+0x674
0x4da13f20: user 0xfded1660: srr1=0xd032
r1=0xffffe1e0 cr=0x84002008 xer=0x20000000 ctr=0xfded1650
crash> bt/a 231a5c40
trace: pid 27957 lid 27957 at 0x21583bc0
0x21583c20: at cpu_switchto+0x28
0x21583c30: at mi_switch+0x140
0x21583c70: at sleepq_block+0xe0
0x21583c90: at turnstile_block+0x284
0x21583cf0: at rw_enter+0x124
0x21583d30: at uvm_fault_internal+0xa3c
0x21583e70: at trap+0x674
0x21583f20: user 0xfded3d80: srr1=0xd032
r1=0xffffda10 cr=0x84008004 xer=0x20000000 ctr=0
crash> bt/a 695a10c0
trace: pid 10847 lid 10847 at 0x5523ec10
0x5523ec70: at cpu_switchto+0x28
0x5523ec80: at mi_switch+0x140
0x5523ecc0: at sleepq_block+0xe0
0x5523ece0: at turnstile_block+0x284
0x5523ed40: at rw_enter+0x124
0x5523ed80: at uvm_unmap_remove+0x158
0x5523edc0: at uvmspace_free+0xe0
0x5523ede0: at exit1+0x1a8
0x5523eea0: at sys_exit+0x44
0x5523eec0: at syscall+0x294
0x5523ef20: user SC trap #1 by 0xfd201548: srr1=0xd032
r1=0xffffe590 cr=0x24442442 xer=0x20000000 ctr=0xfd201540
crash>
crash> bt/a 10778340
trace: pid 1085 lid 1085 at 0x107abbc0
0x107abc20: at cpu_switchto+0x28
0x107abc30: at mi_switch+0x140
0x107abc70: at sleepq_block+0xe0
0x107abc90: at turnstile_block+0x284
0x107abcf0: at rw_enter+0x124
0x107abd30: at uvm_fault_internal+0xa3c
0x107abe70: at trap+0x674
0x107abf20: user 0x18153e0: srr1=0xd032
r1=0xffffd9b0 cr=0x28022444 xer=0 ctr=0xfd2016b8
crash>
crash> bt/a 205d9c40
trace: pid 1075 lid 1075 at 0x17513ca0
0x17513d00: at cpu_switchto+0x28
0x17513d10: at mi_switch+0x140
0x17513d50: at sleepq_block+0xe0
0x17513d70: at turnstile_block+0x284
0x17513dd0: at rw_enter+0x124
0x17513e10: at uvmspace_fork+0x35c
0x17513e40: at uvm_proc_fork+0x34
0x17513e50: at fork1+0x334
0x17513ec0: at syscall+0x294
0x17513f20: user SC trap #2 by 0xfdede44c: srr1=0xd032
r1=0xffffe8c0 cr=0x22000422 xer=0x20000000 ctr=0xfded7ba4
crash>
crash> bt/a 3347a100
trace: pid 1105 lid 1105 at 0x1eb23bc0
0x1eb23c20: at cpu_switchto+0x28
0x1eb23c30: at mi_switch+0x140
0x1eb23c70: at sleepq_block+0xe0
0x1eb23c90: at turnstile_block+0x284
0x1eb23cf0: at rw_enter+0x124
0x1eb23d30: at uvm_fault_internal+0xa3c
0x1eb23e70: at trap+0x674
0x1eb23f20: user 0xfded3d80: srr1=0xd032
r1=0xffffda30 cr=0x84008004 xer=0x20000000 ctr=0
crash>
This, somewhat unexpectedly, points to "something" in the UVM
subsystem which might be amiss(?), but figuring out what is
above my pay grade.
Guidance for next steps in narrowing down the actual root
cause gratefully accepted.
>How-To-Repeat:
Update a dual-CPU macppc G4 system to recent 10.0_BETA code,
and do a "build.sh -j 2 ... release" run. The probability of
it getting stuck sometime along the build is nearly 1.
>Fix:
Sorry, don't know.
>Unformatted:
Home |
Main Index |
Thread Index |
Old Index