NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
The following reply was made to PR kern/57558; it has been noted by GNATS.
From: Frank Kardel <kardel%netbsd.org@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc:
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:11:51 +0200
This is a multi-part message in MIME format.
--------------100111F35787E6220A86651D
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sure
Setup:
- all userlans NetBSD-10.0_BETA
- NetBSD 10.0_BETA (2023-07-26) (-current should al work) XEN3_DOM0
(pagedaemon patched see pd.diff attachment)
- xen-4.15.1
- NetBSD 10.0_BETA GENERIC as DOMU
- on DOM0 zfs file system providing a file for the FFS file system
in the DOMU
- DOMU has a posgresql 14.8 installation
- testcase is load a significant database (~200 Gb) into the
postgres DB.
this seems complicated to setup (but I am prepaing the kind of VM for
our purposes).
Going by the errors detected it should al be possible (not tested)
- creste ZFS file syystem on a plain GENERIC system
- create a file system file in ZFS
- vnconfig vndX <path the file system file>
- disklabel vndX
- newfs vndXa
- mount /dev/vndXa /mnt
- do lots of fs traffic writing, deleting, rewriting the mount fs
Part 1 - current situation:
Use
sdt:::arc-available_memory
{
printf("mem = %d, reason = %d", arg0, arg1);
}
to track what ZFS thinks is has as memory - positive values mean enough
memory there, negative ask ZFS ARC to free the much of memory.
Use vmstat -m to track pool usage - you should see that ZFS will take
more an more memory until 90% kmem is used in the pools.
At the point you should see a ~100% busy pgdaemon in top and
the pagedaemon patch should list high counts for loop, kvm_starved and
available as uvm_availmem(false) still reports many free pages.
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9813.2250709] pagedaemon: loops=16023729, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16023729, cnt_starved=16023729,
cnt_avail=16023729, fpages=336349
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9819.2252810] pagedaemon: loops=16018349, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16018349, cnt_starved=16018349,
cnt_avail=16018349, fpages=336542
/var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [
9825.2255258] pagedaemon: loops=16025793, cnt_needsfree=0,
cnt_needsscan=0, cnt_drain=16025793, cnt_starved=16025793,
cnt_avail=16025793, fpages=336516
...
That document the tight loop with no progress. the pgdaemon will not
recover - see my analysis.
Observe the arc_reclaim is not freeing anything (and collect no cpu
time see top) because arc_available memory claims that there is enough
free memory (looks at uvm_availmem(false).
The dtrace probe documents that.
Part 2 - get the arc_reclaim thread to actually be triggered before kmem
is starving.
Install Patch 1 from the bug report. This lets ZFS look at the
kmem_arena space situation which a also looked at
uvm_km.c:uvm_km_va_starved_p(void).
Now ZFS has a chance to start reclaiming memory.
Run the load test again.
The dtrace probe should now show decreasing memory until it get
negative. And it will stay negative by a certain amount.
vmstat -m should show that ZFS now only hogs ~75% of kmem.
Also the should be a significant count in the Ide page counts as the
arc_reclaim thread did give up memory.
As the idle page are not yet reclaim from the pool ZFS is asked to
always free memory (dtrace probe) an vmstat -m will
show the non zero Idle page counts. Thus now ZFS has 75% kmem memory
allocated but utilized only a small part. Thus the cache
is allocated but not used anymore.
We need to get the Idle pages actually reclaimed from the pools. This is
done by Patch 2 from the bug.
There is no way to pass this task to the pgdaemon the that looks only
uvm_availmem(false) that does not consider kmem unless starving. Also
the pool drain thread drain one pool at a time per invocation and that
is not even triggered.
so Patch 2 directly reclaims from the pool_cache_invalidate()ed pool.
With this strategy ZFS keeps the kmem usage around 75% as now Idle pages
are reclaimed and ZFS only gets negative arc_available_memory
values when called for.
vmstat will show that ZFS is now in the 75% kmem limit. arc_reclaim will
run at a suitable rate when needed. ZFS pools should not show too many
idle pages (idle pages
are removed after some cool down time to reduce xcall activity if I read
the code right).
dtrace should show positive and negative arc_available memory figures.
I did not keep the vmstat and dtrace and top outputs. But from a busy db
loading DOMU ( databases > 350 GB)
I see a vmstat -m of
Memory resource pool statistics
Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg
Maxpg Idle
...
zfs_znode_cache 248 215697 0 0 13482 0 13482 13482 0 inf 0
zil_lwb_cache 208 84 0 0 5 0 5 5 0 inf 0
zio_buf_1024 1536 11248 0 7612 3278 1460 1818 1818 0 inf 0
zio_buf_10240 10240 1130 0 723 973 566 407 407 0 inf 0
zio_buf_114688 114688 351 0 200 339 188 151 151 0 inf 0
zio_buf_12288 12288 1006 0 714 721 429 292 305 0 inf 0
zio_buf_131072 131072 3150 89 2176 1841 867 974 974 0 inf 0
zio_buf_14336 14336 473 0 308 432 267 165 166 0 inf 0
zio_buf_1536 2048 2060 0 1065 549 51 498 498 0 inf 0
zio_buf_16384 16384 9672 0 481 9318 127 9191 9191 0 inf 0
zio_buf_2048 2048 2001 0 826 682 94 588 588 0 inf 0
zio_buf_20480 20480 461 0 301 428 268 160 160 0 inf 0
zio_buf_24576 24576 448 0 293 404 249 155 155 0 inf 0
zio_buf_2560 2560 2319 1 490 1948 119 1829 1829 0 inf 0
zio_buf_28672 28672 369 0 221 345 197 148 152 0 inf 0
zio_buf_3072 3072 4163 2 422 3861 120 3741 3741 0 inf 0
...
zio_buf_7168 7168 506 0 292 465 251 214 214 0 inf 0
zio_buf_8192 8192 724 0 329 635 240 395 395 0 inf 0
zio_buf_81920 81920 379 0 229 371 221 150 161 0 inf 0
zio_buf_98304 98304 580 0 421 442 283 159 163 0 inf 0
zio_cache 992 4707 0 0 1177 0 1177 1177 0 inf 0
zio_data_buf_10 1536 39 0 33 20 17 3 12 0 inf 0
zio_data_buf_10 10240 2 0 2 2 2 0 2 0 inf 0
zio_data_buf_13 131072 488674 0 323782 274996 110104 164892 191800 0
inf 0
zio_data_buf_15 2048 25 0 19 13 10 3 7 0 inf 0
zio_data_buf_20 2048 17 0 13 9 7 2 4 0 inf 0
zio_data_buf_20 20480 1 0 1 1 1 0 1 0 inf 0
zio_data_buf_25 2560 7 0 6 7 6 1 5 0 inf 0
...
Totals 222323337 98 210229180 1033080 125800 907280
In use 24951773K, total allocated 25255540K; utilization 98.8%
In the unpatched case all 32GB where allocated.
The arc_reclaim_thread clocked in 20 CPU sec - that is ok.
Current dtrace output is:
dtrace: script 'zfsmem.d' matched 1 probe
CPU ID FUNCTION:NAME
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
7 274 none:arc-available_memory mem = 384434176, reason = 2
1 274 none:arc-available_memory mem = 384434176, reason = 2
The page daemon was never woken up and has 0 CPU seconds. in 2 days.
This all looks very much as desired.
Hope this helps.
Best regards,
Frank
--------------100111F35787E6220A86651D
Content-Type: text/x-patch;
name="pd.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="pd.diff"
--- /src/NetBSD/n10/src/sys/uvm/uvm_pdaemon.c 2023-07-29 17:52:46.392362932 +0200
+++ /src/NetBSD/n10/src/sys/uvm/.#uvm_pdaemon.c.1.133 2023-07-29 14:18:05.000000000 +0200
@@ -270,11 +270,15 @@
/*
* main loop
*/
-
+/*XXXkd*/ unsigned long cnt_needsfree = 0L, cnt_needsscan = 0, cnt_drain = 0, cnt_starved = 0, cnt_avail = 0, cnt_loops = 0;
+/*XXXkd*/ time_t ts, last_ts = time_second;
for (;;) {
bool needsscan, needsfree, kmem_va_starved;
+/*XXXkd*/ cnt_loops++;
+
kmem_va_starved = uvm_km_va_starved_p();
+/*XXXkd*/ if (kmem_va_starved) cnt_starved++;
mutex_spin_enter(&uvmpd_lock);
if ((uvm_pagedaemon_waiters == 0 || uvmexp.paging > 0) &&
@@ -311,6 +315,8 @@
needsfree = fpages + uvmexp.paging < uvmexp.freetarg;
needsscan = needsfree || uvmpdpol_needsscan_p();
+/*XXXkd*/ if (needsfree) cnt_needsfree++;
+/*XXXkd*/ if (needsscan) cnt_needsscan++;
/*
* scan if needed
*/
@@ -328,8 +334,18 @@
wakeup(&uvmexp.free);
uvm_pagedaemon_waiters = 0;
mutex_spin_exit(&uvmpd_lock);
+/*XXXkd*/ cnt_avail++;
}
+/*XXXkd*/ if (needsfree || kmem_va_starved) cnt_drain++;
+/*XXXkd*/ ts = time_second;
+/*XXXkd*/ if (ts > last_ts + 5 && cnt_loops > 5 * 10000) {
+/*XXXkd*/ printf("pagedaemon: loops=%ld, cnt_needsfree=%ld, cnt_needsscan=%ld, cnt_drain=%ld, cnt_starved=%ld, cnt_avail=%ld, fpages=%d\n",
+/*XXXkd*/ cnt_loops, cnt_needsfree, cnt_needsscan, cnt_drain, cnt_starved, cnt_avail, fpages);
+/*XXXkd*/ cnt_needsfree = cnt_needsscan = cnt_drain = cnt_starved = cnt_avail = cnt_loops = 0;
+/*XXXkd*/ last_ts = ts;
+/*XXXkd*/ }
+
/*
* scan done. if we don't need free memory, we're done.
*/
--------------100111F35787E6220A86651D--
Home |
Main Index |
Thread Index |
Old Index