Subject: Re: lib/35535: memcpy() is very slow if not aligned
To: None <port-amd64-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: Kimura Fuyuki <fuyuki@hadaly.org>
List: netbsd-bugs
Date: 02/03/2007 14:25:02
The following reply was made to PR port-amd64/35535; it has been noted by GNATS.
From: Kimura Fuyuki <fuyuki@hadaly.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: lib/35535: memcpy() is very slow if not aligned
Date: Sat, 3 Feb 2007 23:24:24 +0900
On Saturday 03 February 2007, David Laight wrote:
>
> 1) I'm not sure that optimisations for > 128k copies are necessarily
> worthwhile. Code ought to be passing such data by reference!
> In the kernel, the only common large copy is (ought to be) the
> copy-on-write of shared pages.
In kernel use, it's just true that code for >128k is not so useful. I put it
just because we have libs shared on both kernel and user land. If you think
optimization for larger buffer is not a good idea, it could be removed or
#ifdef-outed for kernel.
> 2) You want to look at the costs for short copies. They are much more
> common than you think.
> I've not done any timings for 'rep movsx', but I did do some for
> 'rep stosx' a couple of years ago. The instruction setup costs on
> modern cpus is significant, so they shouldn't be used for small loops.
> A common non-optimisation is the use of a 'rep movsb' instruction to
> move the remaining bytes - which is likely to be zero [1].
> One option is to copy the last 4/8 bytes first!
> I also discovered that the pentium IV needs the target address to be
> 8 byte aligned!
Fact 1: I misunderstood the gcc's optimization policy a bit; I've thought
memcpy()s are more aggressively inlined or unrolled to mov ops. So, short
copies are important, right. But, they *are* properly inlined in many cases.
from gcc(1):
-mmemcpy
-mno-memcpy
Force (do not force) the use of "memcpy()" for non-trivial block
moves. The default is -mno-memcpy, which allows GCC to inline most
constant-sized copies.
Fact 2: I think just one branch is not a burden for modern cpus. Real number
follows. (ya, could be a little burden..)
plain:
$ time ./memcpy_bench 64 100000000 0 0
dst:0x502080 src:0x5020c0 len:64
./memcpy_bench 64 100000000 0 0 3.36s user 0.00s system 99% cpu 3.390 total
patched:
$ time ./memcpy_bench 64 100000000 0 0
dst:0x502080 src:0x5020c0 len:64
./memcpy_bench 64 100000000 0 0 3.49s user 0.00s system 99% cpu 3.517 total
Fact 3: I didn't touch the rep part of the code. I made the patch small as far
as I can. I agree that rep prefix should be carefully used.
> 3) (2) may well apply to the use to movsb to align copies.
Actually, I tried three versions of alignment code including movsb-less one
and took simpler and faster. Anyway, there's no big difference in these
three. Note also that the memcpy's dest address is very likely to be already
aligned.
The real (what's real?) latency for rep instructions can be seen here (8.3):
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
Thanks for your comment.
> [1] Certain compilers convert:
> while (a < b)
> *a++ = ' ';
> into the inlined version of memset(), including 2 'expensive to setup'
> 'rep stosx' instructions, when I explictily wrote the loop because the
> loop count is short....
gcc 4 is a little bit smarter than that, I think. :)