Subject: Re: bcopy optimisation
To: None <port-arm32@NetBSD.ORG>
From: Olly Betts <olly@MANTIS.CO.UK>
List: port-arm32
Date: 07/09/1996 00:02:46
"Mark Brinicombe" writes:
>[fast bcopy needed]
>In addition to making it fast typically using the LDM and STM instructions
>consideration needs to be given to the sizes being copied. Logging statistics
>for the bcopy routine shows that it is regularly called for certain sizes
>of copy far more frequently than others.
>The most common sizes are 12, 8, 128, 6, 4, 16, 2 in that order.
>This may mean that the best performance will be gained if these sizes are
>spotted and specially coded.
OK, here's a first attempt. I've gone for the "source and destination 4-byte
aligned, size multiple of 4 bytes" case, which probably covers most of the
common ones Mark lists. This doesn't handle overlapping blocks (i.e. it's
memcpy, not memmove). Mark asked for an "overlapping memcpy" -- does this
mean memmove is actually required?
fast_memcpy
; In: R0 -> src, R1 -> dest, R2 = length
;Out: R0 preserved (R1,R2,R3,ip corrupted as APCS allows)
; Are src and dest are word-aligned and we're copying a multiple of 4 bytes?
ORR R3,R0,R1
ORR R3,R3,R2
TST R3,#3
BNE memcpy ; whatever is currently used as memcpy
;
; OK, we're ready to rock'n'roll...
; Use ip as R0 needs to be unchanged on exit
MOV ip,R0
|_alignedwordcpy|
|_alignedwordcpylp3|
SUBS R2,R2,#4
LDRGE R3,[ip],#4
STRGE R3,[R1],#4
; to unroll this loop, repeat these 3 instructions
SUBGES R2,R2,#4
LDRGE R3,[ip],#4
STRGE R3,[R1],#4
;
BNE |_alignedwordcpylp3|
MOVS PC,R14
I've tested this under RISC OS on an ARM610 Risc PC and is 25% faster than
the Shared C Library on a selection of small aligned blocks with sizes which
are multiples of 4. I haven't had time to install RiscBSD yet :(
BTW, a quick play at unrolling suggested the code as I've given it is
a good trade-off.
Olly