Port-amd64 archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: 70,000 TLB shootdown IPIs per second
On Wed, Dec 05, 2012 at 08:48:31AM -0500, Thor Lancelot Simon wrote:
> I have been doing some testing on a fileserver recently donated to TNF.
>
> The system has an Areca 1280 controller (arcmsr driver) with a single
> RAID6 volume configured on 12 disks; 32GB of RAM; two quad-core Xeon L5420
> CPUs.
>
> I have tested under NetBSD-6 and NetBSD-current as well as the tls-maxphys
> branch. The test is a 'dd bs=2m of=/test count=65536". Write throughput,
> while acceptable (300-350MB/sec; 350-400MB/sec with tls-maxphys) is about
> 2/3 what I get on the same hardware under Linux. Read throughput is less
> good, 250-300MB/sec.
>
> The filesystem is FFSv2, 32k block/4k frag, with WAPBL.
>
> Watching systat while I do the dd tests, I see up to 70,000 TLB shootdown
> IPIs per second. Is this really right? I am not sure I know how to count
> these under Linux but I don't see any evidence of them. Is there a pmap
> problem? I'm running port-amd64.
>
> I have also seen some other odd things I'll detail elsewhere, but just for
> a start, can anyone explain to me whether this number of TLB shootdowns
> should be expected, whether each should really generate its own IPI, and
> what the performance impact may be?
time for a little fun with dtrace, now that it works on amd64:
#!/usr/sbin/dtrace -qs
fbt::pmap_tlb_shootnow:entry
{
@a[ stack() ] = count();
}
and the top few entries from that with a portion of your dd test are:
netbsd`pmap_deactivate+0x3b
netbsd`mi_switch+0x329
netbsd`idle_loop+0xe0
netbsd`0xffffffff80100817
4349
netbsd`pmap_deactivate+0x3b
netbsd`mi_switch+0x329
netbsd`kpreempt+0xe2
netbsd`0xffffffff80114295
netbsd`ubc_uiomove+0x113
netbsd`ffs_write+0x2c5
netbsd`VOP_WRITE+0x37
netbsd`vn_write+0xf9
netbsd`dofilewrite+0x7d
netbsd`sys_write+0x62
netbsd`syscall+0x94
netbsd`0xffffffff801006a1
6168
netbsd`pmap_deactivate+0x3b
netbsd`softint_dispatch+0x3a2
netbsd`0xffffffff8011422f
8028
netbsd`pmap_deactivate+0x3b
netbsd`mi_switch+0x329
netbsd`idle_loop+0xe0
netbsd`cpu_hatch+0x16b
netbsd`0xffffffff805cd345
18813
netbsd`pmap_deactivate+0x3b
netbsd`mi_switch+0x329
netbsd`sleepq_block+0xa4
netbsd`cv_wait+0x101
netbsd`workqueue_worker+0x4e
netbsd`0xffffffff80100817
20973
netbsd`pmap_update+0x3b
netbsd`uvm_pagermapout+0x29
netbsd`uvm_aio_aiodone+0x94
netbsd`workqueue_worker+0x7f
netbsd`0xffffffff80100817
20990
netbsd`pmap_update+0x3b
netbsd`uvm_unmap_remove+0x316
netbsd`uvm_pagermapout+0x69
netbsd`uvm_aio_aiodone+0x94
netbsd`workqueue_worker+0x7f
netbsd`0xffffffff80100817
20990
netbsd`pmap_update+0x3b
netbsd`uvm_pagermapin+0x18e
netbsd`genfs_gop_write+0x2f
netbsd`genfs_do_putpages+0xc74
netbsd`VOP_PUTPAGES+0x3a
netbsd`ffs_write+0x316
netbsd`VOP_WRITE+0x37
netbsd`vn_write+0xf9
netbsd`dofilewrite+0x7d
netbsd`sys_write+0x62
netbsd`syscall+0x94
netbsd`0xffffffff801006a1
85652
netbsd`pmap_update+0x3b
netbsd`ubc_alloc+0x514
netbsd`ubc_uiomove+0xe1
netbsd`ffs_write+0x2c5
netbsd`VOP_WRITE+0x37
netbsd`vn_write+0xf9
netbsd`dofilewrite+0x7d
netbsd`sys_write+0x62
netbsd`syscall+0x94
netbsd`0xffffffff801006a1
685211
netbsd`pmap_update+0x3b
netbsd`ubc_release+0x26a
netbsd`ubc_uiomove+0x113
netbsd`ffs_write+0x2c5
netbsd`VOP_WRITE+0x37
netbsd`vn_write+0xf9
netbsd`dofilewrite+0x7d
netbsd`sys_write+0x62
netbsd`syscall+0x94
netbsd`0xffffffff801006a1
685211
so this is working as currently designed, though obviously there's plenty of
room for improvement. other OSs have been moving to using large-page
permanent mappings of RAM for accessing cached file data (linux has
done that for ages), and I've been wanting to do it for us too but I haven't
had the time to embark on it. I did add support for the requisite "direct map"
stuff to amd64 just over a year ago, so at least that part is done.
the second biggest IPI offender in this workload is the current need to
map pages into the kernel's address space in order to send them to a disk
driver for I/O, even if the disk driver just ends up arranging for the pages
to be read or written by DMA. this is another thing that I've wanted to
improve for a long time, but that's also a non-trivial project.
as for a short-term workaround, some workloads will probably
do better with a larger setting of UBC_WINSHIFT, but that will
most likely hurt other workloads.
-Chuck
Home |
Main Index |
Thread Index |
Old Index