tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: notes from running will-it-scale



There is:

Built-in Function: void * __builtin_assume_aligned (const void *exp,
size_t align, ...)

but I don't know how it is working out in practice. Note both direct
map arg and the address should be aligned.

However, an argument for dedicated routines is that on CPUs with ERMS
it is faster to rep stosb/movsb than stosq/movsq (and the other way
around without said extension). If rep-using memset is inlined you are
stuck with stosq by default.

On 7/19/20, Jaromír Doleček <jaromir.dolecek%gmail.com@localhost> wrote:
> Very interesting, particularly the outrageous assembly for
> pmap_{zero,copy}_page().
>
> Is there some way to tell the compiler that the address is already
> 4096-aligned and avoid the conditionals? Failing that, we could just
> adopt the FreeBSD assembly for this.
>
> Does anyone see a problem with introducing a vfs.timestamp_precision
> to avoid the rtdscp?
>
> Jaromir
>
> Le dim. 19 juil. 2020 à 13:21, Mateusz Guzik <mjguzik%gmail.com@localhost> a écrit :
>>
>> Hello,
>>
>> I recently took an opportunity to run cross-systems microbenchmarks
>> with will-it-scale and included NetBSD (amd64).
>>
>> https://people.freebsd.org/~mjg/freebsd-dragonflybsd-netbsd-v2.txt
>> [no linux in this doc, I will probably create a new one soon(tm)]
>>
>> The system has a lot of problems in the vfs layer, vm is a mixed bag
>> with multithreaded cases lagging behind and some singlethreaded being
>> pretty good (and at least one winning against the other systems).
>>
>> Notes:
>> - rtdscp is very expensive in vms, yet the kernel unconditionally
>> performs by calling vfs_timestamp. Both FreeBSD and DragonflyBSD have
>> a knob to change the resolution (and consequently avoid the
>> instruction), I think you should introduce it and default to less
>> accuracy on vms. Sample results:
>> stock pipe1: 2413901
>> patched pipe1: 3147312
>> stock vfsmix: 13889
>> patched vfsmix: 73477
>> - sched_yield is apparently a nop when the binary is not linked with
>> pthread. this does not match other systems and is probably a bug.
>> - pmap_zero_page/pmap_copy_page compile to atrocious code which keeps
>> checking for alignment. The compiler does not know what values can be
>> assigned to pmap_direct_base and improvises.
>>
>>    0xffffffff805200c3 <+0>:       add    0xf93b46(%rip),%rdi        #
>> 0xffffffff814b3c10 <pmap_direct_base>
>>    0xffffffff805200ca <+7>:       mov    $0x1000,%edx
>>    0xffffffff805200cf <+12>:      xor    %eax,%eax
>>    0xffffffff805200d1 <+14>:      test   $0x1,%dil
>>    0xffffffff805200d5 <+18>:      jne    0xffffffff805200ff
>> <pmap_zero_page+60>
>>    0xffffffff805200d7 <+20>:      test   $0x2,%dil
>>    0xffffffff805200db <+24>:      jne    0xffffffff8052010b
>> <pmap_zero_page+72>
>>    0xffffffff805200dd <+26>:      test   $0x4,%dil
>>    0xffffffff805200e1 <+30>:      jne    0xffffffff80520116
>> <pmap_zero_page+83>
>>    0xffffffff805200e3 <+32>:      mov    %edx,%ecx
>>    0xffffffff805200e5 <+34>:      shr    $0x3,%ecx
>>    0xffffffff805200e8 <+37>:      rep stos %rax,%es:(%rdi)
>>    0xffffffff805200eb <+40>:      test   $0x4,%dl
>>    0xffffffff805200ee <+43>:      je     0xffffffff805200f1
>> <pmap_zero_page+46>
>>    0xffffffff805200f0 <+45>:      stos   %eax,%es:(%rdi)
>>    0xffffffff805200f1 <+46>:      test   $0x2,%dl
>>    0xffffffff805200f4 <+49>:      je     0xffffffff805200f8
>> <pmap_zero_page+53>
>>    0xffffffff805200f6 <+51>:      stos   %ax,%es:(%rdi)
>>    0xffffffff805200f8 <+53>:      and    $0x1,%edx
>>    0xffffffff805200fb <+56>:      je     0xffffffff805200fe
>> <pmap_zero_page+59>
>>    0xffffffff805200fd <+58>:      stos   %al,%es:(%rdi)
>>    0xffffffff805200fe <+59>:      retq
>>    0xffffffff805200ff <+60>:      stos   %al,%es:(%rdi)
>>    0xffffffff80520100 <+61>:      mov    $0xfff,%edx
>>    0xffffffff80520105 <+66>:      test   $0x2,%dil
>>    0xffffffff80520109 <+70>:      je     0xffffffff805200dd
>> <pmap_zero_page+26>
>>    0xffffffff8052010b <+72>:      stos   %ax,%es:(%rdi)
>>    0xffffffff8052010d <+74>:      sub    $0x2,%edx
>>    0xffffffff80520110 <+77>:      test   $0x4,%dil
>>    0xffffffff80520114 <+81>:      je     0xffffffff805200e3
>> <pmap_zero_page+32>
>>    0xffffffff80520116 <+83>:      stos   %eax,%es:(%rdi)
>>    0xffffffff80520117 <+84>:      sub    $0x4,%edx
>>    0xffffffff8052011a <+87>:      jmp    0xffffffff805200e3
>> <pmap_zero_page+32>
>>
>> The thing to do in my opinion is to just provide dedicated asm funcs.
>> This is the equivalent on FreeBSD (ifunc'ed):
>>
>> ENTRY(pagezero_std)
>>         PUSH_FRAME_POINTER
>>         movl    $PAGE_SIZE/8,%ecx
>>         xorl    %eax,%eax
>>         rep
>>         stosq
>>         POP_FRAME_POINTER
>>         ret
>> END(pagezero_std)
>>
>> ENTRY(pagezero_erms)
>>         PUSH_FRAME_POINTER
>>         movl    $PAGE_SIZE,%ecx
>>         xorl    %eax,%eax
>>         rep
>>         stosb
>>         POP_FRAME_POINTER
>>         ret
>> END(pagezero_erms)
>>
>> --
>> Mateusz Guzik <mjguzik gmail.com>
>>
>


-- 
Mateusz Guzik <mjguzik gmail.com>



Home | Main Index | Thread Index | Old Index