Subject: Re: inter-cpu coherency and kernel data structures
To: None <thorpej@zembu.com>
From: Hal Murray <murray@pa.dec.com>
List: tech-smp
Date: 08/16/2000 20:35:24
I'll take a try at this, at least the Alpha part.

I'm not a wizard in this area, but I hang out with people who are 
and I like low level stuff so this is an excuse for me to learn.

Standard disclaimers etc.  This is my best effort, nothing official.  
(But please let me know if I botch anything.) 


> On the Alpha, you're supposed to issue a memory barrier to ensure
> that other "processors" see the changes you've made to a memory
> location -- basically, you drain the write buffer.  The "processor"
> may be an I/O device (such as a disk controller, DMA mapper, etc.)

First red flag.  The MB doesn't necessarily drain the write buffer.  
 

On an Alpha, a CPU is allowed to reorder loads and stores. 

First consider a single CPU system.  The CPU can hang on to modified 
data for an arbitrary time and it can do arbitrary prefetches.  It 
can reorder stores relative to other stores.  It can reorder loads 
relative to other loads.  It can reorder loads/stores.  It must look 
to the programmer as if it is doing things in the order specified.

  EV6s have write-back caches so writes to the outside world can 
  easily get reordered and have huge delays. 

Now consider multiple CPUs.  The question is if CPU #1 executes a 
sequence of loads and stores, will CPU #2 see the same order?  The 
answer is that CPU #2 can see a different order unless you use the 
MB.

With the MB, all the loads/stores before the MB appear to other CPUs 
to happen before any of the loads/stores after the MB. 

So the typical sequence of:
  lock
  critical-section-with-loads+stores
  unlock

Should get turned into:
  lock
  MB
  critical-section-with-loads+stores
  MB
  unlock

Obviously, the MBs can get folded into the lock/unlock macros.

A good term to use when discussing this is "leak".  Think of the 
nominal sequence of loads and stores as a reference, remembering 
that the lock/unlock includes loads and stores.  The first MB above 
keeps any loads from "leaking" out of the critical section by getting 
prefetched before the loads/stores used to get the lock.  The second 
MB keeps stores in the critical section from leaking out past the 
store in the unlock, it prevents delayed writes. 

MBs are expensive.  You don't want to sprinkle them around just to 
be safe.  (Of course, you don't want to miss one.)


> While doing some hacking in the pmap module, it occurred to me that
> this would be a good idea for PTEs, too.  But then I thought -- geez,
> where does it end?  Do you issue an MB after frobbing with any random
> linked list structure?  Etc.

I would expect SMP ready code would have a lock/mutex to protect 
the linked list.  If so, the MB as described above (and folded into 
the lock/unlock macros) should do the right thing. 

VAXes, for example, had microcode that did atomic queue manipulation 
so they didn't need locks.  If you have code like that you probably 
need to add locks.

For very simple updates, you might be able to write some sneaky code 
that does one ldx_l/stx_c rather than two.  The idea is to do the 
work directly rather than using a lock.  (The restrictions on what 
you can do between the ldx_l and the stx_c are tight.)



> Well, after disassembling OSF/1's simple lock routines, they clearly
> are using MBs after stl_c, and Brown Book (newer version of Green Book :-)
> doesn't say anything about stl_c performing a barrier.  stl_c merely
> interacts with the "locked" register in the memory controller, as far
> as I can tell.

The stl_c doesn't do the barrier, just the conditional store. 

> In any case, my question was mostly about things which are not using
> ldl_l/stl_c, but rather other global data structures.  Like linked
> lists, PTEs, VM objects, etc. which are generally protected with
> locks, which would then have to ensure coherency of THOSE structures
> by issuing an MB.

I think most things will work correctly if the MB is included in 
the lock/unlock macros.  If I'm missing something, please let me 
know. 

------

PTEs are interesting.  If you modify a PTE entry, you have to kick 
the other CPUs so they can reload any TLBs that might be holding 
that info.  I think you can avoid that if you are changing a PTE 
from invalid to valid.

  Note that Alpha TLBs have an address-space slot that gets compared 
  like address bits, so even if you know that a process isn't running 
  on a CPU, there might be TLBs with leftover info.  (The idea is 
  to speed up context switches.) 

------

The other tricky area that I know about is writes to PCI devices.  
If you want to make sure that a write to a card has happened (say 
to turn off an interrupt), you have to read that location.  This 
is ugly because IO reads are painfully slow. 

There is no way to really flush things.  Even if you get them out 
of the Alpha, they can get hung up in PCI-PCI bridges and such. 

So the pattern is:
  write magic location
  MB
  read magic location

The MB makes sure the read doesn't get prefetched inside the Alpha. 
 
Once the writes are outside the Alpha, the PCI rules require that 
the writes happen before the read does.
 
If you can't read that location because of side-effects, you can 
do a MB, then a dummy write to another location, and then read that 
location instead. 


If you are writing control blocks on the device and then setting 
a flag to start activity, you probably need an MB to make sure the 
write-to-start doesn't get reordered ahead of the writes for the 
control info. 

------

>	1. alpha_mb();
>	2. alp->lock_data = 0;
>	3. alpha_mb();
...
>	3. Ensure that the lock is released before anything else happens.

The second MB isn't necessary, at least if you are using the typical 
lock, critical-section, unlock pattern.  No other CPU can get the 
lock until they see the write that unlocks the lock.  (Remember that 
the MB isn't flusing any buffers, just making sure that the order 
as seen by other CPUs is correct.)