continued zfs-related lockups

To: netbsd-users%netbsd.org@localhost
Subject: continued zfs-related lockups
From: Greg Troxel <gdt%lexort.com@localhost>
Date: Tue, 09 Jul 2024 14:46:22 -0400
I have having continued zfs-related lockups on two systems and am
posting some anecdata/comments.  I am building a LOCKDEBUG kernel to see
if that changes anything.  Both systems are up-to-date netbsd-10.

System 1 is bare metal, 32G ram.

System 2 is xen, 4000M RAM in the dom0.  Issues described are provoked
in the dom0 without domUs doing that much.

I am carrying a patch to reduce the arc size based on memory size.
Something like this should be committed because the current approach
just uses a lot.  This is paged out of my head, but here are my boot
printfs on system 1:

  ARCI 002 arc_abs_min 16777216
  ARCI 002 arc_c_min 1067485440
  ARCI 005 arc_c_max 4269941760
  ARCI 010 arc_c_min 1067485440
  ARCI 010 arc_p     2134970880
  ARCI 010 arc_c     4269941760
  ARCI 010 arc_c_max 4269941760
  ARCI 011 arc_meta_limit 1067485440

  Basically you can see about 4G of data and 1G for meta.


On system 2:

  ARCI 002 arc_abs_min 16777216
  ARCI 002 arc_c_min 131072000
  ARCI 005 arc_c_max 524288000
  ARCI 010 arc_c_min 131072000
  ARCI 010 arc_p     262144000
  ARCI 010 arc_c     524288000
  ARCI 010 arc_c_max 524288000
  ARCI 011 arc_meta_limit 131072000

  Basically you can see 512MB for arc data and 128MB for meta.

These values should not mess up a 32G and 4000 MB system.  One can argue
about whether they should be somewhat bigger or somewhat smaller of
course, and more importantly how memory  pressure on other things should
interact.  IMHO they should be considered part of the file cache.

Also, due to past experiences I have the following in sysctl.conf.  I
think it was having processes paged out to make room for file cache when
that resulted in performance I didn't like.

  # \todo Reconsider and document
  vm.filemin=5
  vm.filemax=10
  vm.anonmin=5
  vm.anonmax=80
  vm.execmin=5
  vm.execmax=50
  vm.bufcache=5


* system 1 (32G bare metal)

The problem smells like "proceses running out of ram and asking for more
while the system is doing lots of zfs operations".  Sort of pkg_rr or
build.sh running, and flipping tabs in firefox.  The lockup starts
gradually and gets worse.  If I don't leave firefox running overnight,
and especially if I don't leave piggy tabs open, crashes are much less
frequent.

I managed to catch it early and flip out of x to text and then into ddb.
I am still learning how to interpret things, but:

  there were several processes in tstile.  the underlying locks seem to
  be:
    - zfs:buf_hash_table+0x1300
    - netbsd:vfs_suspend_lock (from a rename system call IIRC)

  some of the wchans are flt_noram5.  I realize that is normal

  several pools super big:
    - zfs_znode_cache: size 240 npages 822221
    - zio_buf_512: size 512 npages 240926 nitems 735236 nout 1187132

I interpret this as

  zfs has allocated too much ram

  something did a fsop which requires vfs suspend

  something else tried to operate during that suspend, perhaps
  deadlocking with ram acquisition


The mystery is why others aren't seeing this.

* system 2 (4000 MB dom0)

This machine is used for building packages, in the dom0 and in 4 domUs,
and has distfiles, pkgsrc for about a year of quarters, current and wip,
binary packages and ccache dirs.

The real issue seems to be the ccache dirs.  In total there are 16G of
cachefiles across 5 cpu/os/version tuple values (the 4 it uses, plus
netbsd-10-aarch64 over nfs from a RPI4).  There are about 1.5M files.
Just find on that pushes pool usage from 657K to 3635K.

  $ vmstat -m|tail -5; find /tank0/ccache -type f|wc -l; vmstat -m|tail -5
  [white space adjusted to make this easier to follow]

  zio_link_cache 48    3695    0        0    44     0    44    44     0   inf    0
  Totals             763425    0    83834 82688     0 82688
  In use 636990K, total allocated 657236K; utilization 96.9%

   1510577

  zio_link_cache 48    3695    0        0    44     0    44    44     0   inf    0
  Totals            5223892    0   365030 539962    0 539962
  In use 3539172K, total allocated 3635060K; utilization 97.4%



On this system, the symptom is that the system just stops responding.  I
am running

  /sbin/wdogctl -x -p 367 tco0

and it recovers automatically.

It has been crashing on /etc/daily.  I am able to provoke the crash with
"find /big/place -type f | wc -l".  It feels worse lately, and I
wondered about hardware.   I rolled back to a kernel from 6/26 from 7/1.
It could be that my data has gotten bigger.

I find that total pool usage as reported by vmstat -m goes up as I run
find, to unreasoable levels.  As an example after running find over
ccache and pkgsrc-current, I see

  In use 1645282K, total allocated 1751476K; utilization 93.9%

which is way too much for a 4000M machine.

It seems like if I do

  find /place1
  wait many minutes
  find /place2

then the second find does not vastly increase pools.  But if I don't
wait, it does.  I have seen pool total (top) as high as 3465M.

Eventually when pushed too far, the machine locks up, and entering ddb I
saw nothing in tstile.

Stay tuned for lockdebug info on system 2.  I can crash that without
causing me extra effort.

Overall, it feels like zfs has some kind of cache that is not arc, and
has unreasonable limits.

I realize some of you think 4000M is low memory, but the system should
be stable if slow anyway.  and 32G is really a healthy amount of ram
these days.



So: if you have a system with a lot of files in zfs, and you don't mind
crashing it, running find would be interesting.
Even if it doesn't crash, seeing vmstat -m would be interesting.

Here are pools on the 32G system, up 5 days;

Name        Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
zio_buf_2048 2048  155225    0   153857 60590 59530  1060 17041     0   inf    1
arc_buf_t     32   960575    0   919523  1417   147  1270  1417     0   inf    0
uarea       24576    3507    0     2694  2148   866  1282  1282     0   inf  469
i915_vma     704    14335    0     6162  2362   727  1635  1635     0   inf    0
zio_buf_4096 4096   33584    0    31722 21870 20008  1862  5335     0   inf    0
range_seg_cache 64 735359    0   679561  3968  2062  1906  2985     0   inf    0
drm_i915_gem_ob 768 15917    0     6368  2656   746  1910  1910     0   inf    0
kmem-00256   256   339404    0   309748 18365 16446  1919 16939     0   inf   65
kmem-00008     8  1437175    0  1307639  1959    12  1947  1959     0   inf   19
dirhashblk  2048     4768    0      379  2269    74  2195  2229     0   inf    0
kmem-00768   768    38965    0    26829  3343   915  2428  2428     0   inf    0
kmem-00128   128   515001    0   469640 14110 11490  2620 14110     0   inf  767
zio_buf_2560 2560  177067    0   173838 152692 149463 3229 46055    0   inf    0
vmmpepl      128   326633    0   277087  3821   558  3263  3263     0   inf  284
ractx         32   618097    0   208209  3782   482  3300  3383     0   inf   10
vmembt        64   644698    0   492932  3492     0  3492  3492     0   inf    0
zio_cache    984    65228    0    64252  7865  4122  3743  4476     0   inf 3499
bufpl        272    82024    0    25305  4636   665  3971  3971     0   inf   11
kmem-00032    32  1100883    0   807683  4978   811  4167  4978     0   inf  116
buf16k      16384   29586    0    14699  5639  1422  4217  4430     0   inf    0
zio_data_buf_13 131072 109221 0  104477 92654 87910  4744 17380     0   inf    0
kmem-01024  1024    80367    0    55194  9091  2051  7040  7424     0   inf  746
kva-16384   16384  431364    3   290889  9652   289  9363  9634     0   inf    0
pcglarge    1024  1280332    0  1236977 44969 34130 10839 11712     0   inf    0
kmem-00384   384   517294    0   408327 20823  9926 10897 10897     0   inf    0
phpool-64     56  2038826    4  1494992 12943  1023 11920 12917     0   inf   20
pvpage      4096    31460    1    21252 24435 11658 12777 14270     0   inf 2569
zio_buf_3584 3584  283333    0   270526 253663 240856 12807 71768   0   inf    0
kmem-00064    64  2452484    0  1843883 15528  1886 13642 15528     0   inf 2091
radixnode    128   857688    0   410392 17462  2971 14491 14491     0   inf    0
zio_buf_3072 3072  355898    0   340436 306458 290996 15462 96733   0   inf    0
buf2k       2048    59752    0    18803 27664  7188 20476 20477     0   inf    0
anonpl        32 10356573    0  7483518 36255 13452 22803 24315     0   inf    0
pcgnormal    256  3758866    0  3660775 72144 49266 22878 36859     0   inf 1271
kmem-00192   192  1066440    0   735229 25691   336 25355 25691     0   inf 1414
arc_buf_hdr_t_f 200 2852920  0  2738056 46724 21056 25668 38169     0   inf    4
mutex         64  2386139    0   717605 26491     4 26487 26487     0   inf    0
ffsdino2     256   690230    0   206450 33605   475 33130 33130     0   inf 1268
ffsino       272   690230    0   206450 35861   522 35339 35339     0   inf 1377
zio_buf_16384 16384 704287  32   667026 644883 607622 37261 149436  0   inf    0
sa_cache     104  3115636    0  2072844 42952  1936 41016 42952     0   inf   30
rwlock        64  3849631    0   561234 52220    23 52197 52197     0   inf    0
namecache    128  2437368    0   850070 53742     0 53742 53742     0   inf    0
kmem-02048  2048   805665    0   663183 253638 182078 71560 251451  0   inf  319
zfs_znode_cache 240 3088835  0  2046043 109029 12668 96361 102010   0   inf   23
dmu_buf_impl_t 208 6425103   0  5303519 124086 2184 121902 124086   0   inf 2697
vcachepl     576  2252152    0   725245 240158 1866 238292 238292   0   inf    1
zio_buf_512  512  7094849    0  6011939 343777 96177 247600 279141  0   inf    0
dnode_t      632  5534168    0  4451068 372058 39269 332789 364352  0   inf 10837



In contrast, on a 5G domU with no zfs, I see 800M of pools.  The top
users by npages are

Name        Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
vmmpepl      144    80217    0    53575  1028    13  1015  1015     0   inf   12
kmem-2048   2048     8529    0     6268  1729   598  1131  1194     0   inf    0
kmem-1024   1024    17696    0    12741  1450   182  1268  1300     0   inf    4
pcglarge    1024    62614    0    58086  8374  7097  1277  1277     0   inf  145
kva-4096    4096   149007    0    42472  1989   172  1817  1817     0   inf    0
anonpl        32  1574629    0  1152567  3711   240  3471  3611     0   inf    0
buf1k       1024   156309    0    29550  4702   729  3973  4096     1     1    0
mutex         64   784470    0   518929  6424   977  5447  5470     0   inf    1
pvpl          40  1271682    0   726803  5734   121  5613  5703     0   inf    0
buf8k       8192    40441   42    17360  7585  1814  5771  6869     1     1    0
bufpl        296   175399    0    24677 12733  1139 11594 12114     0   inf    0
ncache       192   522831    0   271515 12258    47 12211 12211     0   inf    0
ffsdino2     256   797229    0   542806 27135  8784 18351 20976     0   inf    0
ffsino       256   790479    0   536056 27110  8736 18374 20976     0   inf    0
vcachepl     336   786851    0   531358 37452 13970 23482 27972     0   inf    0
Follow-Ups:
- Re: continued zfs-related lockups
  - From: Greg Troxel
Prev by Date: Re: HISTFILE support for /bin/sh
Next by Date: Trying to start NPF
Previous by Thread: NetBSD on brcm router.
Next by Thread: Re: continued zfs-related lockups
Indexes:
Home | Main Index | Thread Index | Old Index