stats collection in sys/kern/vfs_cache.c

To: NetBSD Kernel Technical Discussion List <tech-kern%netbsd.org@localhost>
Subject: stats collection in sys/kern/vfs_cache.c
From: Dennis Ferguson <dennis.c.ferguson%gmail.com@localhost>
Date: Mon, 1 Dec 2014 16:10:19 +0800

I thought I would pass by some proposed changes to the statistics
collection in sys/kern/vfs_cache.c to see if I am over-doing (or
under-doing) it, or just doing it wrong.  Statistics are annoying.

Since this code was made MP-capable the basic method of stats
collection has been to provide each processor with a separate
counter structure which is incremented only by threads running on
that processor.  When cache_reclaim() runs (at least once per second)
the per-processor counts are then collected, added into the subsystem
totals and zeroed.  This is a nice method, the counters are generally
well-cached on the cpu doing increments, with no contention.  It
would be nice if networking stats could work like this.

This original code was, however, a bit sloppy about it.  Some of
the counters were incremented inside locked code sections which
prevented cache_reclaim() from running concurrently, but some
were incremented at points where no lock was held.  In the
latter case the increments were just done anyway, non-atomically.
This means that counting races might cause a second's worth of
increments to be counted twice occasionally, as well as dropped
increments.  Worse, however, was that it constrained the counters
to be long's to at least ensure that writes were atomic, and a long
on a 32 bit machine isn't long enough for some of these (my build
machine incremented one of the counters by more than 20 billion
in less than 3 days of uptime).

The more recent modification to the code fixed the sloppiest part
of this by adding locking around every counter increment.  The
problem with this is that since the primary purpose of the locks
is operational they can sometimes be held for a long time, blocking
forward progress for the sake of a stats counter increment.  And
while it added a structure with 64 bit counters for the sysctl(3)
interface it left the internal structure into which the stats
are tallied with the long counters.  I'd like to get it back to
a cost of stats collection closer to the original sloppy code by
removing those locks, but maybe not be as sloppy as the original.

It was suggested that I use atomic operations for the counters
instead, but I'm not fond of that.  Since it is going to the
expense of keeping the per-cpu counts it should be possible to
leverage that to avoid atomics, and I particularly don't want to
burden cache_reclaim() stats harvesting with atomic operations
since cache_reclaim() has enough to do already.  Here's what I'd
like to do instead:

- Make the subsystem total counts maintained by cache_reclaim()
  (nchstats) 64 bits long on all machines, the same as the
  sysctl interface.  Make the per-cpu counts 32 bits on all
  machines.  They only count a second's worth of activity on a
  single cpu, for which 32 bits is plenty, and this makes
  per-cpu counter writes atomic on all supported machines.

- Make cache_reclaim() refrain from zeroing the per-cpu counts in
  favor of computing differences between the current count and the
  count it saw last time it ran.  This makes its stats gathering
  lockless while avoiding atomic operations at the expense of
  space to store the last counts.

- Have all increments done to the per-cpu stats for the cpu the
  thread doing the increment is running on, and do non-atomic
  increments even when no lock is held.  For the latter to fail
  requires that the thread doing the outside-a-lock increment
  be interrupted by another thread running on the same CPU which
  goes on to increment the same counter.  I suspect (but can't
  quite prove) this can't happen, but if it does the worst
  consequence of it seems to be the occasional loss of a single
  increment.

- cache_stat_sysctl() can now do the minimal locking needed to
  hold off a concurrent run of cache_reclaim() while it copies
  the values the latter updates.  It can make do with a copy of
  cache_reclaim()'s 1 second stats, or update that to current
  values by polling the per-cpu stats itself (the latter makes
  'systat vmstat' output less blocky).

The attached patch may do this.  It also changes the name of the
struct (but with no change to its members) passed via sysctl(3)
from 'struct nchstats_sysctl' to 'struct nchstats', and for 32 bit
machines changes the size of the 'struct nchstats nchstats' which
vmstat can be told to read from a dead kernel's core.  Would either
of those changes be enough to require a version bump to the kernel?

Dennis Ferguson

Attachment: cache_stats.patch
Description: Binary data

Prev by Date: Re: Converting kernel printf() to aprint_*() or log()
Next by Date: Re: struct ifnet and ifaddr handling [was: Re: Making global variables of if.c MPSAFE]
Previous by Thread: Invalidate page cache
Next by Thread: driver concurrency
Indexes:

Home | Main Index | Thread Index | Old Index