tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: continued zfs-related lockups



Simon Burge <simonb%NetBSD.org@localhost> writes:

> I can reproduce this behaviour with:
>
> 	for d in $(seq 0 99); do
> 	  echo dir $d; mkdir dir$d
> 	  seq 0 99 | xargs -n 1 -I % sh -c "echo $d % > dir$d/%"
> 	done
> 	rm -rf dir? dir?? &
> 	vmstat
> 	   [ check how much KB is free ]
> 	dd if=/dev/zero of=/dev/null bs=820000k count=50
> 	   [ where 820000 kB was just under the amount of memory free ]

Do you mean you can reproduce slow and/or messages, or actually a hard
lockup from which the system does not return?

> After creating the files, this also works to trigger the messages:
>
> 	vmstat
> 	   [ check how many KB is free ]
> 	dd if=/dev/zero of=/dev/null bs=820000k count=50
> 	   [ where 820000 kB was just under the amount of memory free ]
> 	find dir* -type f | xargs cat > /dev/null
>
> The "dd if=/dev/zero of=/dev/null bs=XXX" thing is a good way to
> allocate a chunk of user memory, probably quite similar to how your
> "touchmem" program does in practice.

agreed.

>> [ 2247.3254720] arc_reclaim_thread: negative free_memory -15888384
>
> Doesn't this mean "Can you try to free 15888384 bytes if possible"?

Yes.  This is a message I added, so presumably you applied my patch that
I think I sent.

It means that arc_available_memory is negative.   Basically, arc can
grow beyond its desired size, and if the system is not out of memory,
that's ok, but if it is, then arc > desired is negative free so arc
should shrink.

This situation is not intrinsically problematic.
A page I find helpful:
  https://www.brendangregg.com/blog/2012-01-09/activity-of-the-zfs-arc.html

> Running a few "sysctl kstat.zfs.misc.arcstats.size" shows:
>
>  - before the "dd" and "find ... cat":
> 	kstat.zfs.misc.arcstats.size = 31990280
>  - during the "dd":
> 	kstat.zfs.misc.arcstats.size = 31991240
> 	kstat.zfs.misc.arcstats.size = 31996984
>  - after "dd" and "find ... cat" finishes
> 	kstat.zfs.misc.arcstats.size = 31995776
>
> I think this is ZFS noticing free memory is low and trying to do
> something about it, but perhaps not very successfully?

Yes, it is noticing and not actually freeing.

>> I wonder if others who have problems also see this kernel message.
>
> This is on an amd64 qemu VM with 1GB of RAM and a 384MB disk (all ZFS).

A punishing test case, but really it ought to work.


I have recently come to understand something about the "size of the
ARC".  In addition to data buffers in the MRU/MFU lists, the headers for
those buffers, and headers only for the ghost lists, there is other
storage in use and recorded as arc usage.  However, there is no real way
to free a lot of that other storage.

Worse, after the system has been up and continued to be stresseed for a
while, the arc size gets even bigger (in a way that can't be freed) and
this seems to run the system really out of memory and then, my theory is
that some bug is hit that causes a lockup, and while that should be
fixed, the major pain point is ARC growth beyond intent.

A particularly important kind of other storage is "dnode_t" (from a
pool), associated with vnodes.  So setting kern.maxvnode lower helps a
lot in avoiding trouble.  This may be 90% of the issues, maybe even
more.

I therefore wonder:

  Are dnodes freed (and recorded as freed) when vnodes are recycled?

  Do vnodes really need to hang onto dnodes?

  There are "znodes" it seems.  How do these relate?

  The ARC code  talks about metadata a lot.  What precisely is in
  entries marked metadata.

  In accounting, metadata is everything that is not a data buf.  Thus
  all this "other" data is counted as metadata, which I think leads to
  aggressive eviction of in-cache objects marked metadata, and this
  doesn't seem to match the intent.

  FreeBSD has a call into dnlc (which is more like namei cache I think)
  under pressure.  I would think it helpful if zfs vnodes were freed
  when the arc is too big.

  It strikes me that the dnodes (showing up as
    kstat.zfs.misc.arcstats.other_size = 212192
  in this case when ~nothing has happened) not being evictable but
  counted as in the ARC is a puzzling approach that is not clearly
  working.  Rather, if we're going to have lots of dnodes, then
  something that tries to free them when they are over target (and
  memory pressure) might help.


I think the biggest question is how dnode/znode/vnode relate and if a
vnode really needs to hang on the on-disk representation of something.


Home | Main Index | Thread Index | Old Index