Subject: Re: copyin/out
To: <>
From: David Laight <david@l8s.co.uk>
List: port-arm
Date: 08/09/2002 11:18:13
> My three main concerns are:
>
> 1) how does it work on other ARM architectures
>
> 2) is the code too large for the more limited
> of the arm32 archs?
You need to do a system wide benchmark for that. It all depends
of what you displace in order to include your unrolled loops.
I'm also not actually sure (and it is difficult to guess) whether
the code is likely to be in the cache when you start. If not
you need to allow for the memory fetch times of the instructions.
This can mean that code loops are faster than table lookups.
>
> 3) Are there large, unaligned data copies going
> through the copyin/copyout path?
>
> Basically, I've ditched the pte scan and I'm using ldr[b]t and str[b]t
> to access user data. I've also unrolled some loops and I've put in
> some code to prefetch with the 'pld' instruction on XScale
I got a significant benefit on SA1100 by doing a read ahead of the target
address (to pull it into the data cache). I can't remember the speeds
I got (and no longer have a test system). But this is the byte loop:
ldrb r4, [r0], #1
11: subs r2, r2, #1
ldrneb r5, [r1,#24]
strb r4, [r1], #1
ldrneb r4, [r0], #1
bne 11b
Only with the 'prefetch' did the order of the instructions matter.
> With this, I'm seeing copyout run at about 63MB/s on a simple test
> (dd if=/dev/zero of=/dev/null count=1024 bs=1024k).
How fast did it run before?
Some heavily snipped comments..
> /* Quick exit if length is zero */
> teq r2, #0
> moveq r0, #0
> moveq pc, lr
Does this help? Are there enough 0 length transfers for it to matter?
> * Align destination to word boundary.
> and r6, r1, #0x3
> ldr pc, [pc, r6, lsl #2]
> b Lialend
> .word Lialend
> .word Lial1
> .word Lial2
> .word Lial3
> Lial3: ldrbt r6, [r0], #1
> sub r2, r2, #1
> strb r6, [r1], #1
> Lial2: ldrbt r7, [r0], #1
> sub r2, r2, #1
> strb r7, [r1], #1
> Lial1: ldrbt r6, [r0], #1
> sub r2, r2, #1
> strb r6, [r1], #1
> Lialend:
How about:
ands r6, r1, #3
addne pc, pc, r6 lsl #3
b Lialend
nop
nop
ldrbt r7, [r0], #1
strb r7, [r1], #1
ldrbt r7, [r0], #1
strb r7, [r1], #1
ldrbt r7, [r0], #1
strb r7, [r1], #1
eor r6, r6, #2
sub r2, r2, r6
Lialend:
Which, in particular, saves pulling a chunk of the code into
the data cache.
> /* If few bytes left, finish slow. */
> cmp r2, #0x08
> blt Licleanup
Surely it is worth increasing the number of bytes we enter this
path with to ensure that check never takes.
> /* If source is not aligned, finish slow. */
> ands r3, r0, #0x03
> bne Licleanup
Maybe worth checking src and dest have same alignment earlier
>
> /*
> * Align destination to cacheline boundary.
> * If source and destination are nicely aligned, this can be a big
> * win. If not, it's still cheaper to copy in groups of 32 even if
> * we don't get the nice cacheline alignment.
> */
Do you always want to align on the destination?
For SA1100, if you can do 'stm' writes of 4 words then you don't
need to worry about the destination being cached (unless you want
the data soon). Also aligning the source might be a win on
because you could use ldm after an initial ldrt (saving cache).
Also, like the byte align code, it ought to be possible to
avoid the data read and the 'sub r2,r2,#2' in each case.
> * This loop basically works out to:
> * do {
> * prefetch-next-cacheline(s)
> * bytes -= 0x20;
> * copy cacheline
> * } while (bytes >= 0x40);
> * bytes -= 0x20;
> * copy cacheline
I'd try to reduce the code size by only having the 'copy
cacheline' present once. Shouldn't be too hard!
> Licleanup:
> and r6, r2, #0x3
> ldr pc, [pc, r6, lsl #2]
> b Licend
> .word Lic4
> .word Lic1
> .word Lic2
> .word Lic3
> Lic4: ldrbt r6, [r0], #1
> sub r2, r2, #1
> strb r6, [r1], #1
> Lic3: ldrbt r7, [r0], #1
> sub r2, r2, #1
> strb r7, [r1], #1
> Lic2: ldrbt r6, [r0], #1
> sub r2, r2, #1
> strb r6, [r1], #1
> Lic1: ldrbt r7, [r0], #1
> subs r2, r2, #1
> strb r7, [r1], #1
> Licend:
> bne Licleanup
A trully horrid loop!
David
--
David Laight: david@l8s.co.uk