port-arm: Re: Xscale optimisations

Subject: Re: Xscale optimisations
To: David Laight <david@l8s.co.uk>
From: Richard Earnshaw <rearnsha@arm.com>
List: port-arm
Date: 10/14/2003 18:35:54

> > > Mmmm IIRC we only ever saw bursts of the memory bus for cache line writes.
> > > (Although it wsn't me driving the analiser that day.)
> > 
> > Hmm, yes, I suspect I was mistaken on that.   The SA110 timing apps note 
> > does seem to confirm your observations.
> 
> yes - we were expecting to see burst writes, but didn't....
> 
> > > I know I got faster memcpy (on sa1100) by fetching the target buffer
> > > into the data cache (an lda offset by a magic number would do the trick,
> > > didn't stall since the target data was never used!)
> > 
> > Which would be faster would probably depend on the relative 
> > sequential/non-sequential times and the number of words to be written to a 
> > line.  Plus some compensation for the fact that other useful data will 
> > likely be cast out of the cache.  It is believable that 2(N+7S) < 8N (ie 
> > 2.33 S < N) for many memory systems and thus that fetching a line into 
> > cache would most likely be more efficient than writing to memory that was 
> > out of the cache.
> 
> N = first, S = subsequent

Sorry, yes (N=Non-sequential, S=Sequential).  It's terminology from old 
ARM data sheets which talked about N, S and I (Internal) cycles.

A sequential cycle must follow either an N cycle or an S cycle and must be 
at an ascending address (in this case wrap-around on the same CAS address 
would be OK).

So a cache line fill (or drain) would look like

	N-S-S-S-S-S-S-S

and individual stores would be

	N-N-N...

R.