Subject: sparc64 pmap optimizations
To: None <port-sparc64@netbsd.org, tech-kern@netbsd.org>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 08/25/2002 21:46:18
hi folks,

as a warm-up to working on sparc64 MP support, I spent some time
improving the pmap.  the two cases I was looking at in particular
were fork+exit and fork+exec+exit.  the results are:

fork+exit
orig	0.025u 0.979s 0:09.22 10.7%     0+0k 0+1io 0pf+0w
chuq	0.014u 0.674s 0:07.51 9.0%      0+0k 0+0io 0pf+0w
improvement	18.5%

for+exec+exit
orig	0.076u 2.232s 0:07.57 30.3%     0+0k 0+3io 0pf+0w
chuq	0.137u 1.609s 0:05.92 29.2%     0+0k 0+1io 0pf+0w
improvement	21.8%


the diffs are available for review at

	ftp://ftp.netbsd.org/pub/NetBSD/misc/chs/sparc64pmap/

(as per my usual patch distribution method, apply the latest patch
to -current as of that date).


the MI changes are:

 - there's a new optional pmap interface:

	void pmap_predestroy(struct pmap *)

   this is a hint to the pmap layer that this pmap will be destroyed soon,
   and that the only operations that will be performed on it before then
   are pmap_remove()s.  the pmap layer indicates the availability of this
   interface to UVM by defining __HAVE_PMAP_PREDESTROY in <machine/pmap.h>.

 - there's an optional semantic change to pmap_activate():

   I've got uvmspace_exec() switching to the kernel's pmap temporarily
   in order to speed up tearing down the process's old pmap by calling
   pmap_activate(&proc0).  however, many pmaps cannot handle this.
   a pmap can indicate that it's willing to be called in this context
   by defining __HAVE_PMAP_ACTIVATE_KERNEL in <machine/pmap.h>.

 - we cache a few free kernel stacks (currently 16) to avoid cache flushing
   and mucking with kernel_map.  there's currently no feedback mechanism
   for forcing these to be freed.  I'm imagining some kind of registration
   mechanism for the pagedaemon to call back to the various subsystems that
   might be hanging on to significant memory that could be freed easily.


here's the overall list of changes:

 - use struct vm_page_md for attaching pv entries to struct vm_page
 - change pseg_set()'s return value to indicate whether the spare page
   was used as an L2 or L3 PTP.
 - use a pool for pv entries instead of malloc().
 - put PTPs on a list attached to the pmap so we can free them
   more efficiently (by just walking the list) in pmap_destroy().
 - use the new pmap_predestroy() interface to avoid flushing the cache and TLB
   for each pmap_remove() that's done as we are tearing down an address space.
 - in pmap_enter(), handle replacing an existing mapping more efficiently
   than just calling pmap_remove() on it.  also, skip flushing the
   TSB and TLB if there was no previous mapping, since there can't be
   anything we need to flush.
 - allocate hardware contexts like the MIPS pmap:
   allocate them all sequentially without reuse, then once we run out
   just invalidate all user TLB entries and flush the entire CPU cache.
 - fix pmap_extract() for the case where the va is not page-aligned and
   nothing is mapped there.
 - fix calculation of TSB size.  it was comparing physmem (which is
   in units of pages) to constants that only make sense if they are
   in units of bytes.
 - remove code to handle impossible cases in various functions.
 - tweak asm code to pipeline a little better.
 - remove a few unnecessary spls.



could people try out these changes and let me know how it goes?
I did my testing on an ultra2.

also, feedback on the changes in general would be great.

-Chuck