I thought I would pass by some proposed changes to the statistics collection in sys/kern/vfs_cache.c to see if I am over-doing (or under-doing) it, or just doing it wrong. Statistics are annoying. Since this code was made MP-capable the basic method of stats collection has been to provide each processor with a separate counter structure which is incremented only by threads running on that processor. When cache_reclaim() runs (at least once per second) the per-processor counts are then collected, added into the subsystem totals and zeroed. This is a nice method, the counters are generally well-cached on the cpu doing increments, with no contention. It would be nice if networking stats could work like this. This original code was, however, a bit sloppy about it. Some of the counters were incremented inside locked code sections which prevented cache_reclaim() from running concurrently, but some were incremented at points where no lock was held. In the latter case the increments were just done anyway, non-atomically. This means that counting races might cause a second's worth of increments to be counted twice occasionally, as well as dropped increments. Worse, however, was that it constrained the counters to be long's to at least ensure that writes were atomic, and a long on a 32 bit machine isn't long enough for some of these (my build machine incremented one of the counters by more than 20 billion in less than 3 days of uptime). The more recent modification to the code fixed the sloppiest part of this by adding locking around every counter increment. The problem with this is that since the primary purpose of the locks is operational they can sometimes be held for a long time, blocking forward progress for the sake of a stats counter increment. And while it added a structure with 64 bit counters for the sysctl(3) interface it left the internal structure into which the stats are tallied with the long counters. I'd like to get it back to a cost of stats collection closer to the original sloppy code by removing those locks, but maybe not be as sloppy as the original. It was suggested that I use atomic operations for the counters instead, but I'm not fond of that. Since it is going to the expense of keeping the per-cpu counts it should be possible to leverage that to avoid atomics, and I particularly don't want to burden cache_reclaim() stats harvesting with atomic operations since cache_reclaim() has enough to do already. Here's what I'd like to do instead: - Make the subsystem total counts maintained by cache_reclaim() (nchstats) 64 bits long on all machines, the same as the sysctl interface. Make the per-cpu counts 32 bits on all machines. They only count a second's worth of activity on a single cpu, for which 32 bits is plenty, and this makes per-cpu counter writes atomic on all supported machines. - Make cache_reclaim() refrain from zeroing the per-cpu counts in favor of computing differences between the current count and the count it saw last time it ran. This makes its stats gathering lockless while avoiding atomic operations at the expense of space to store the last counts. - Have all increments done to the per-cpu stats for the cpu the thread doing the increment is running on, and do non-atomic increments even when no lock is held. For the latter to fail requires that the thread doing the outside-a-lock increment be interrupted by another thread running on the same CPU which goes on to increment the same counter. I suspect (but can't quite prove) this can't happen, but if it does the worst consequence of it seems to be the occasional loss of a single increment. - cache_stat_sysctl() can now do the minimal locking needed to hold off a concurrent run of cache_reclaim() while it copies the values the latter updates. It can make do with a copy of cache_reclaim()'s 1 second stats, or update that to current values by polling the per-cpu stats itself (the latter makes 'systat vmstat' output less blocky). The attached patch may do this. It also changes the name of the struct (but with no change to its members) passed via sysctl(3) from 'struct nchstats_sysctl' to 'struct nchstats', and for 32 bit machines changes the size of the 'struct nchstats nchstats' which vmstat can be told to read from a dead kernel's core. Would either of those changes be enough to require a version bump to the kernel? Dennis Ferguson
Attachment:
cache_stats.patch
Description: Binary data