Subject: Re: Performance dive
To: Toru Nishimura <nisimura@itc.aist-nara.ac.jp>
From: Chuck Silvers <chuq@chuq.com>
List: port-pmax
Date: 03/18/2001 15:37:26
hi Tohru,
I can confirm there's some degradation from 1.5, but it's not as bad
for me as you're reporting.
running bonnie on my 5000/200 with a 1995-vintage 8GB micropolis disk:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
today's -1-5:
256 947 99.4 3084 48.3 1024 28.7 888 99.7 2650 35.2 51.1 7.1
452.259u 277.027s 18:20.86 66.2% 0+0k 16404+86655io 7pf+0w
today's -current:
256 793 98.5 2390 96.5 976 45.1 805 98.0 2399 65.0 55.4 14.5
459.653u 499.222s 20:26.84 78.1% 0+0k 16+218io 61pf+0w
from this I would say we've become cpu-bound for large writes,
and we're using a lot more cpu in gneral.
I did some profiling runs on large sequential writes, and here are
the functions using more than 2 seconds of cpu in my test:
% cumulative self self total
time seconds seconds calls us/call us/call name
-1-5:
50.18 88.46 88.46 memmove
15.34 115.49 27.03 htonl
8.21 129.97 14.48 126 114924.36 115591.09 mips_idle
4.64 138.14 8.17 _mcount
2.18 141.98 3.84 377824 10.16 10.16 __qdivrem
1.71 145.00 3.02 128066 23.55 53.35 ffs_blkpref
1.54 147.71 2.71 1910709 1.42 1.42 _splset
1.28 149.96 2.25 41210 54.60 174.31 ncr53c9x_intr
-current:
26.15 89.73 89.73 memcpy
21.82 164.61 74.88 _splset_noprof
5.60 183.81 19.20 __mcount
5.53 202.79 18.98 _mcount
5.33 221.08 18.30 481581 37.99 37.99 memset
4.73 237.32 16.23 255215 63.61 63.61 mips1_FlushDCache
2.31 245.26 7.94 208315 38.12 212.97 genfs_getpages
2.20 252.80 7.55 384757 19.61 62.52 pmap_enter
1.75 258.80 5.99 _splraise_noprof
1.38 263.53 4.73 128066 36.97 61.54 ffs_blkpref
1.28 267.91 4.38 804991 5.44 5.44 lockmgr
0.82 270.72 2.81 128602 21.87 266.03 uvm_fault
0.69 273.08 2.36 64035 36.85 308.30 ubc_alloc
0.68 275.41 2.33 56782 41.00 158.77 ncr53c9x_intr
0.68 277.73 2.32 593161 3.92 35.59 pmap_remove
0.67 280.03 2.29 192043 11.94 235.17 ubc_init
0.61 282.11 2.08 72112 28.87 56.06 uvm_map
0.59 284.13 2.02 1593440 1.27 1.54 siiintr
it appears that somebody has interrupts blocked during a lot of the
profiling samples, but we can't tell who. undoing the most recent change
to arch/mips/include/profile.h gives more useful information:
27.51 86.72 86.72 memcpy
20.02 149.82 63.10 __mcount
5.98 168.66 18.84 _mcount
5.67 186.54 17.88 481456 37.14 37.14 memset
4.93 202.09 15.55 254910 61.00 61.00 mips1_FlushDCache
4.92 217.62 15.52 4506101 3.44 3.44 _splset
2.46 225.37 7.75 208336 37.22 228.37 genfs_getpages
2.35 232.78 7.41 384781 19.25 67.52 pmap_enter
1.49 237.49 4.71 128066 36.79 67.37 ffs_blkpref
1.40 241.89 4.41 805046 5.47 5.47 lockmgr
0.92 244.80 2.91 128613 22.63 282.16 uvm_fault
0.85 247.48 2.67 4105781 0.65 0.65 _splraise
0.77 249.89 2.42 56281 42.96 168.49 ncr53c9x_intr
0.73 252.19 2.29 72122 31.79 59.70 uvm_map
0.72 254.45 2.27 192041 11.80 249.56 ubc_init
0.71 256.70 2.25 593612 3.78 37.01 pmap_remove
0.68 258.84 2.14 64035 33.37 324.01 ubc_alloc
and now we can see that our mystery interrupt-blocker was mcount().
next is memset(), which is mostly being called from genfs_getpages()
to zero pages we're allocating for the file. this is pretty silly,
but I have some ideas on how to avoid it. I already knew about this
and it's on my list of things to fix.
next is mips1_FlushDCache(), mostly called from pmap_remove_pv() ->
pmap_remove() -> pmap_enter(). pmap_enter() is called about 2/3
of the time from uvm_pagermapin() and the other 1/3 of the time
from ubc_fault(). now this interesting thing here is that pmap_enter()
called from uvm_pagermapin() or ubc_fault() should only ever be overwriting
an invalid mapping, never changing one valid mapping to a different one,
so it's shouldn't be necessary to flush anything. the calls to pmap_remove()
in pmap_enter() are:
if (pmap == pmap_kernel()) {
...
if (mips_tlbpfn_to_paddr(pte->pt_entry) != pa) {
pmap_remove(pmap, va, va + NBPG);
...
}
...
return 0;
}
if (mips_tlbpfn_to_paddr(pte->pt_entry) != pa) {
pmap_remove(pmap, va, va + NBPG);
...
}
if the previous mapping was invalid, then we should skip the pmap_remove()
just like we do if the new mapping is the same as the old one.
could you try making that change and see what difference it makes
in cpu consumption?
next is _splset(), which is called from
0.36 0.00 105696/4506101 uvm_pagefree [55]
0.44 0.00 128068/4506101 bdwrite [56]
0.44 0.00 128174/4506101 uvm_pagealloc_strat [54]
0.44 0.00 128181/4506101 genfs_getpages <cycle 4> [12]
0.44 0.00 128294/4506101 malloc [88]
0.44 0.00 128669/4506101 trap [13]
0.44 0.00 128736/4506101 free [106]
0.57 0.00 165242/4506101 softclock [72]
0.71 0.00 205957/4506101 getblk [42]
0.71 0.00 205970/4506101 bremfree [73]
0.71 0.00 206054/4506101 brelse [69]
0.81 0.00 234005/4506101 pmap_page_protect [49]
0.94 0.00 271678/4506101 dec_3max_init [32]
1.18 0.00 342811/4506101 uvm_unlock_fpageq [76]
1.32 0.00 383520/4506101 pmap_remove_pv [22]
1.33 0.00 384781/4506101 pmap_enter_pv [46]
2.18 0.00 632441/4506101 uvm_pagelookup [44]
[27] 6.7 15.52 0.00 4506101 _splset [27]
uvm_pagelookup() uses splvm() to protect uvm.hashlock, but I don't think
this is necessary since we shouldn't be changing the identity of pages
in interrupt handlers. we can try removing all the spls that are protecting
uvm.hashlock.
pmap_enter_pv() and pmap_remove_pv() use splvm() to protect() the pv list.
this wouldn't be necessary if the mips pmap actually implemented
pmap_kenter_pa() the right way. I haven't seen your new pmap so I don't
know if you've already done something about it.
uvm_unlock_fpageq() uses splvm() to protect the uvm.fpageqlock,
which is actually necessary in the current implementation of the
page free lists since we allow allocation of pages from interrupts.
we could try to improve this but let's leave that for later.
dec_3max_init() is really dec_3max_intr(), I don't know if anything
can be improved there.
pmap_page_protect() uses splvm() to protect the pv list again,
and it's called from uvm_mk_pcluster() and uvmpd_scan_inactive().
in the test that I used for these measurements, each page should be
looked at twice after it's filled with data: once to write it to disk
in the async flush in ffs_write(), and again to free it in the pagedaemon.
these correspond to the two callers of pmap_page_protect(), in that order.
the rewrite of the pageout code which I'm about to start on will change the
pmap_page_protect() during a pageout into a pmap_clear_modify(), so that
will get rid of those calls. the pagedaemon will always need to remove
a page's mappings before it can safely free it, so that call will stay.
looking at the case in this particular test again, by the time the pagedaemon
considers a page for freeing, that page should already have no mappings
(since the only mapping was in pager_map, and that would already have been
recycled). the code in pmap_page_protect() is:
case VM_PROT_READ:
case VM_PROT_READ|VM_PROT_EXECUTE:
pv = pa_to_pvh(pa);
s = splvm();
/*
* Loop over all current mappings setting/clearing as appropos.
*/
if (pv->pv_pmap != NULL) {
for (; pv; pv = pv->pv_next) {
va = pv->pv_va;
/*
* XXX don't write protect pager mappings
*/
if (va >= uvm.pager_sva && va < uvm.pager_eva)
continue;
pmap_protect(pv->pv_pmap, va, va + PAGE_SIZE,
prot);
}
}
splx(s);
break;
/* remove_all */
default:
pv = pa_to_pvh(pa);
s = splvm();
while (pv->pv_pmap != NULL) {
pmap_remove(pv->pv_pmap, pv->pv_va,
pv->pv_va + PAGE_SIZE);
}
splx(s);
}
in both of these cases, we block interrupts before we know whether there's
any work to do. we should change this to avoid calling spl* if the pv list
is empty.
the next block of callers of _splset() are the buffer cache and malloc(),
which I'm not going to worry about right now. after that is
genfs_getpages(), which uses splbio() to protect bufpool.
it would be a lot better if the pool code didn't require the caller
to provide interrupt protection like this. we can think about this,
but it's not that much of a problem yet.
ok, back to the top level of the stack, the next-highest cpu pig is
genfs_getpages(), which I'll have to look at optimizing.
next is pmap_enter(), which (as I said above) is called about 2/3
of the time from uvm_pagermapin() and the other 1/3 of the time
from ubc_fault(). uvm_pagermapin() itself is pretty bogus,
since most hardware either does physically-addressed DMA or
else uses a separate iommu, so the only reason we have it
is because the device strategy interface uses a virtual address
to specify the memory it should operate on. we should change
block-based devices to operate on physical pages rather than
virtual address, which is a big project. this is also on my list
of things to fix eventually, but I wouldn't mind if someone
beat me to it.
next is ffs_blkpref(), which uses 50% more cpu in -current than
it did in 1.5 even though we call it the same number of times.
I'd guess this is TLB- and cache-thrashing, which is also on my list
of things to look at.
next is lockmgr(). 1/3 of the calls to lockmgr are to protect pager_map,
so those will go away if we change the device strategy interface as above.
another 1/3 of the calls are to protect the kernel_map around ubc_fault(),
which is pretty silly. I've put a new item on my list to look into that.
(actually I just got a idea of what to do about it, it shouldn't be hard
to get rid of those calls.) the other 1/3 protect the block map of the file,
which we need to keep.
next is uvm_fault(), which I think I can eliminate as well.
(the major usage of it in this test, not the function. :-) )
next is _splraise(), which has identical usage to _splset() above.
the same optimizations apply.
next is ncr53c9x_intr(), which I'll leave to someone else to look at.
next is uvm_map(), which is called almost exclusively by uvm_pagermapin()
in this test. see above.
next is ubc_init() which is really ubc_fault(), and that should go away
with the same change that eliminates uvm_fault().
next is pmap_remove(), which we mostly already talked about.
last is ubc_alloc(), which we need to keep, but I'll look into speeding it up.
so I think that about covers it. to summarize, the things we should do are:
1. avoid zeroing pages in genfs_getpages() [chuq]
2. avoid unnecessary calls to pmap_remove() in pmap_enter() [mips]
3. remove spl protection from uvm.hashlock [chuq]
4. implement pmap_kenter_pa() so that is doesn't use
pv lists [mips]
5. avoid blocking interrupts in various pmap operations
when the pv list is empty [mips]
6. try to speed up genfs_getpages(), etc [chuq]
7. change device strategy interface to eliminate need for
pager_map in most cases [????/chuq]
8. reduce TLB and cache thrashing [chuq]
9. change the way UBC handles large writes to eliminate
a bunch of silly work we do now. [chuq]
so it appears that most of the work is mine, but it'd be nice if you
could take care of the mips-specific bits. hopefully these changes
for the write case will help the read case as well.
I'd like also revert the change to profile.h that obscured the first
set of -current results, does anyone have any problems with that?
-Chuck
On Sat, Mar 17, 2001 at 11:11:16AM +0900, Toru Nishimura wrote:
> Hi, all.
>
> I've built -current kernel 1.5S with UBC usage balancing code and ran
> bonnie disk I/O test. UBC kernel used to be simply unusable with such
> the heavy weight I/O being done. My MAXINE got stucked waiting for
> something in uvn_xxx() _for ever_. 1.5S kernel behaves differently
> and system is useful this time.
>
> However, the performance figure disasspointed me. With 1995 vintage
> 1GB NEC SCSI drive, I got;
>
> 771579 bytes/second for writing the file
> 1625699 bytes/second for reading the file
>
> I well remember that write performance used to be 2 times better when
> MI 53C9x SCSI driver was under debug with the hardware. It is worse
> than my AHA1540B (of 1.3.3/i386) with the same HD model.
>
> Do people have affirmatives or no?
>
> Tohru Nishimura