Subject: Re: copyin/out
To: David Laight <david@l8s.co.uk>
From: Allen Briggs <briggs@wasabisystems.com>
List: port-arm
Date: 08/09/2002 16:54:54
On Fri, Aug 09, 2002 at 11:18:13AM +0100, David Laight wrote:
> You need to do a system wide benchmark for that. It all depends
> of what you displace in order to include your unrolled loops.
I'm actually interested in what people observe on different ARM
architectures, hence my post to port-arm. If this code isn't
better for some, then we need to do something different. If it
is better, then I think we want to use it.
> > Basically, I've ditched the pte scan and I'm using ldr[b]t and str[b]t
> > to access user data. I've also unrolled some loops and I've put in
> > some code to prefetch with the 'pld' instruction on XScale
>
> I got a significant benefit on SA1100 by doing a read ahead of the target
> address (to pull it into the data cache). I can't remember the speeds
> I got (and no longer have a test system). But this is the byte loop:
> ldrb r4, [r0], #1
> 11: subs r2, r2, #1
> ldrneb r5, [r1,#24]
> strb r4, [r1], #1
> ldrneb r4, [r0], #1
> bne 11b
> Only with the 'prefetch' did the order of the instructions matter.
That makes sense since your stalls were otherwise lost in the noise of
a cache miss-write-through cycle. The 'prefetch' pulled the target
into the cache so you didn't have a cache miss and so the store went
to the cache instead of to memory.
> > With this, I'm seeing copyout run at about 63MB/s on a simple test
> > (dd if=/dev/zero of=/dev/null count=1024 bs=1024k).
> How fast did it run before?
Depends on the cache mode. With it using standard write-back cache
(like on the SA-110), it was running closer to 40MB/s. With the
write-allocate cacheline allocation policy, this was only slightly
better.
> Does this help? Are there enough 0 length transfers for it to matter?
As Jason said, we're going to be profiling this.
> > * Align destination to word boundary.
> > and r6, r1, #0x3
> > ldr pc, [pc, r6, lsl #2]
> > b Lialend
> > .word Lialend
> > .word Lial1
> > .word Lial2
> > .word Lial3
> > Lial3: ldrbt r6, [r0], #1
> > sub r2, r2, #1
> > strb r6, [r1], #1
> > Lial2: ldrbt r7, [r0], #1
> > sub r2, r2, #1
> > strb r7, [r1], #1
> > Lial1: ldrbt r6, [r0], #1
> > sub r2, r2, #1
> > strb r6, [r1], #1
> > Lialend:
>
> How about:
> ands r6, r1, #3
> addne pc, pc, r6 lsl #3
> b Lialend
> nop
> nop
> ldrbt r7, [r0], #1
> strb r7, [r1], #1
> ldrbt r7, [r0], #1
> strb r7, [r1], #1
> ldrbt r7, [r0], #1
> strb r7, [r1], #1
> eor r6, r6, #2
> sub r2, r2, r6
> Lialend:
I'm not sure I've convinced myself that that's the same thing. Also,
you'll have data dep stalls just using r7 there. Have you tested this?
> > /* If few bytes left, finish slow. */
> > cmp r2, #0x08
> > blt Licleanup
>
> Surely it is worth increasing the number of bytes we enter this
> path with to ensure that check never takes.
Possibly. Although I rather suspect that we often call this code
with 4 or 8 bytes (granted, probably aligned). This gets back to the
histograms... :-)
> > /* If source is not aligned, finish slow. */
> > ands r3, r0, #0x03
> > bne Licleanup
> Maybe worth checking src and dest have same alignment earlier
I did this at first, but this increased the number of branches for
some paths in the code.
> Do you always want to align on the destination?
I'm not sure.
> For SA1100, if you can do 'stm' writes of 4 words then you don't
> need to worry about the destination being cached (unless you want
> the data soon).
That's interesting. Can you elaborate some on this? Or point me to
a specific location in the manual?
> Also aligning the source might be a win on
> because you could use ldm after an initial ldrt (saving cache).
Cache & code size.
> Also, like the byte align code, it ought to be possible to
> avoid the data read and the 'sub r2,r2,#2' in each case.
The sub is really cheap because you'd be stuck in a data stall there
anyway, I believe.
> I'd try to reduce the code size by only having the 'copy
> cacheline' present once. Shouldn't be too hard!
This is the classic size/speed tradeoff, I believe. Either
we have another branch or we may prefetch data we will never
need.
> A trully horrid loop!
Thanks, I think. ;-)
-allen
--
Allen Briggs briggs@wasabisystems.com
http://www.wasabisystems.com/ Quality NetBSD CDs, Sales, Support, Service
NetBSD development for Alpha, ARM, M68K, MIPS, PowerPC, SuperH, XScale, etc...