Subject: Re: Xscale optimisations
To: David Laight <david@l8s.co.uk>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm
Date: 10/14/2003 15:24:48
> > StrongARM load/store multiple instructions are expanded in the pipeline
> > into a sequence of equivalent load/store word operations (which is why
> > they take a long time to *not* execute if the condition fails). A
> > sequence of stores that miss the cache will go direct to the write buffer.
> > Provided that write-coalescing is enabled, this will be used to form a
> > burst on the memory bus.
>
> Mmmm IIRC we only ever saw bursts of the memory bus for cache line writes.
> (Although it wsn't me driving the analiser that day.)
Hmm, yes, I suspect I was mistaken on that. The SA110 timing apps note
does seem to confirm your observations.
>
> I know I got faster memcpy (on sa1100) by fetching the target buffer
> into the data cache (an lda offset by a magic number would do the trick,
> didn't stall since the target data was never used!)
Which would be faster would probably depend on the relative
sequential/non-sequential times and the number of words to be written to a
line. Plus some compensation for the fact that other useful data will
likely be cast out of the cache. It is believable that 2(N+7S) < 8N (ie
2.33 S < N) for many memory systems and thus that fetching a line into
cache would most likely be more efficient than writing to memory that was
out of the cache.
Actually, the DNARD PAL comments suggest it's more complicated than that:
AFAICT a cache line fill will take 14 clock ticks and a line write 12
clocks. 8 individual stores could take as many as 56 clocks, so there
would be a clear win to pre-fetching the line (potentially a factor 4
performance improvement).
R.