Subject: Re: Xscale optimisations
To: None <Richard.Earnshaw@arm.com>
From: David Laight <david@l8s.co.uk>
List: port-arm
Date: 10/14/2003 14:35:58
> StrongARM load/store multiple instructions are expanded in the pipeline
> into a sequence of equivalent load/store word operations (which is why
> they take a long time to *not* execute if the condition fails). A
> sequence of stores that miss the cache will go direct to the write buffer.
> Provided that write-coalescing is enabled, this will be used to form a
> burst on the memory bus.
Mmmm IIRC we only ever saw bursts of the memory bus for cache line writes.
(Although it wsn't me driving the analiser that day.)
I know I got faster memcpy (on sa1100) by fetching the target buffer
into the data cache (an lda offset by a magic number would do the trick,
didn't stall since the target data was never used!)
I also wonder about writing the misaligned tail (esp. of memset)
before doing the bulk write. Gave an improvement for i386 kernel memset.
(although the misaligned memory support makes it a lot easier there)
David
--
David Laight: david@l8s.co.uk