Current-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: random lockups (now suspecting zfs)
Stephen Borrill <netbsd%precedence.co.uk@localhost> writes:
> On Sat, 4 Nov 2023, Simon Burge wrote:
>> Greg Troxel wrote:
>>
>>> So to me this feels like a locking botch in a rare path in zfs.
>>
>> This appears to be the case. Chuck Silvers has some understanding of
>> the problem and I'm helping test, but at this stage there isn't a fix
>> available. :/
>
> It's interesting that you see the lockups during pkgsrc builds, i.e. a
> period where there is lots of file creation. We use zfs on backup
> systems that pull in data with rsync. During the initial runs (where
> every file is new) we usually get a couple of lockups, but during day
> to day operation (few changes) it is reliable. These are on physical
> and virtual machines running NetBSD 9 with the rule of thumb of 1GB
> RAM per TB of storage obeyed, but no patches besides setting MAXPHYS
> in the module to 32k for Xen.
I just had another problem, on the non-xen 32GB machine (which has 3.5T
in the pool, only half full).
The machine wasn't really doing much; X running with xfce, a few xterms,
ssh client, pidgin, and idling firefox with I think 24 tabs.
I found it mostly normal and was using an ssh session, and then switched
to the firefox virtual desktop which failed to redraw. I tried to kill
firefox (because firefox hanging is not so strange :-() and found that a
few of the tabs appeared to be stuck in flt_noram and zio_buf. I think
there might have been a different wchan earlier that was zfs but not
zio_buf.
I think it got in this state due to firefox leaking memory (in SIZE but
not RES?).
(So it might be a missing wakeup on flt_noram, but lock not released
seems plausible also. Totally guessing here.)
(As I was composing this message (in tmux on another machine), the
firefox lockup deteriorated to more things hanging and then a total
lockup. I was unable to ctrl-alt-f1 to get back to the text console.
It is responding to mdns queries and pings and sshd answers but I see
"local version string" and the "remote protocol" line does not appear.
I should try LOCKDEBUG on the package building box (where if it doesn't
work right that's much more ok!).
10853 129 9817 0 85 0 2762264 180704 uvnfp2 DEl ttyp5 7:12.20 (firefox)
10853 994 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox)
10853 1867 9817 0 85 0 3423848 723944 uvnfp2 DEl ttyp5 146:52.32 (firefox)
10853 7407 9817 0 85 0 20169184 355160 flt_nora DEl ttyp5 63:48.76 (firefox)
10853 7630 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox)
10853 8451 9817 0 85 0 2712376 126588 uvnfp2 DEl ttyp5 7:09.93 (firefox)
10853 8504 9817 0 85 0 2744608 143008 uvnfp2 DEl ttyp5 9:56.45 (firefox)
10853 9817 1 21 117 0 12939188 948252 zio_buf_ DEl ttyp5 303:41.53 (firefox)
10853 11066 9817 0 85 0 2821832 225664 &db->db_ DEl ttyp5 1:34.01 (firefox)
10853 11769 9817 0 85 0 2849780 232172 uvnfp2 DEl ttyp5 9:19.27 (firefox)
10853 12055 9817 0 85 0 2832852 144304 uvnfp2 DEl ttyp5 8:49.22 (firefox)
10853 13075 9817 0 85 0 2782516 193652 plpg DEl ttyp5 9:00.21 (firefox)
10853 15399 9817 0 85 0 2822236 249496 uvnfp2 DEl ttyp5 10:12.41 (firefox)
10853 15991 9817 0 85 0 2775316 187104 uvnfp2 DEl ttyp5 7:13.63 (firefox)
10853 16033 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox)
10853 16877 9817 0 85 0 2731156 148896 uvnfp2 DEl ttyp5 1:59.22 (firefox)
10853 17275 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox)
10853 19768 9817 0 85 0 2760188 152880 uvnfp2 DEl ttyp5 7:11.17 (firefox)
10853 21618 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox)
10853 24342 9817 0 85 0 2737588 148452 uvnfp2 DEl ttyp5 11:51.61 (firefox)
10853 24956 9817 0 85 0 2981764 336852 uvnfp2 DEl ttyp5 20:20.13 (firefox)
10853 26368 9817 0 85 0 3164560 240992 uvnfp2 DEl ttyp5 19:28.72 (firefox)
10853 26981 9817 1123 85 0 3659088 770432 flt_nora DEl ttyp5 84:09.22 (firefox)
10853 27139 9817 0 0 0 0 0 - Z ttyp5 0:00.00 (firefox)
10853 29076 9817 2270 85 0 2975552 261064 flt_nora DEl ttyp5 88:44.15 (firefox)
top says
Memory: 14G Act, 6989M Inact, 88M Wired, 549M Exec, 13G File, 228M Free
Swap: 40G Total, 348M Used, 40G Free / Pools: 11G Used
so it did get into paging
vmstat -s:
4096 bytes per page
16 page colors
8079588 pages managed
58419 pages free
3733123 pages active
1789074 pages inactive
1 pages paging
22427 pages wired
1 reserve pagedaemon pages
40 reserve kernel pages
252503 boot kernel pages
2817259 kernel pool pages
2027769 anonymous pages
3376469 cached file pages
140490 cached executable pages
2048 minimum free pages
2730 target free pages
2693196 maximum wired pages
1 swap devices
10485759 swap pages
89099 swap pages in use
534237 swap allocations
6412584353 total faults taken
6413831611 traps
78980016 device interrupts
1337982749 CPU context switches
312160043 software interrupts
11693270083 system calls
519672 pagein requests
84753 pageout requests
0 pages swapped in
1355436 pages swapped out
11042006 forks total
7107987 forks blocked parent
7107987 forks shared address space with parent
5114776798 pagealloc desired color avail
328102950 pagealloc desired color not avail
4442734240 pagealloc local cpu avail
1000145508 pagealloc local cpu not avail
15518 faults with no memory
0 faults with no anons
1780 faults had to wait on pages
0 faults found released page
18115368 faults relock (18042145 ok)
301349249 anon page faults
510693 anon retry faults
588841483 amap copy faults
238859694 neighbour anon page faults
6051925265 neighbour object page faults
1842970725 locked pager get faults
17593932 unlocked pager get faults
193768531 anon faults
107569755 anon copy on write faults
1381042165 object faults
461851046 promote copy faults
4244253805 promote zero fill faults
4295634 faults upgraded lock
15025 faults couldn't upgrade lock
5042 times daemon wokeup
2631 revolutions of the clock hand
29898211 pages freed by daemon
45830406 pages scanned by daemon
992050 anonymous pages scanned by daemon
28906451 object pages scanned by daemon
321731 pages reactivated
0 pages found busy by daemon
1270683 total pending pageouts
49696575 pages deactivated
272335878 per-cpu stats synced
686222 anon pages possibly dirty
1341547 anon pages dirty
0 anon pages clean
70660 file pages possibly dirty
56 file pages dirty
3446243 file pages clean
20979779248 total name lookups
19150722941 good hits
1317343206 negative hits
10803620 bad hits
0 false hits
500909481 miss
0 too long
968276 pass2 hits
1088078 2passes
17136909 reverse hits
10934 reverse miss
0 access denied
cache hits (91% pos + 6% neg) system 0% per-process
deletions 0%, falsehits 0%, toolong 0%
Home |
Main Index |
Thread Index |
Old Index