Subject: Re: inter-cpu coherency and kernel data structures
To: None <thorpej@zembu.com>
From: Hal Murray <murray@pa.dec.com>
List: tech-smp
Date: 08/16/2000 20:35:24
I'll take a try at this, at least the Alpha part.
I'm not a wizard in this area, but I hang out with people who are
and I like low level stuff so this is an excuse for me to learn.
Standard disclaimers etc. This is my best effort, nothing official.
(But please let me know if I botch anything.)
> On the Alpha, you're supposed to issue a memory barrier to ensure
> that other "processors" see the changes you've made to a memory
> location -- basically, you drain the write buffer. The "processor"
> may be an I/O device (such as a disk controller, DMA mapper, etc.)
First red flag. The MB doesn't necessarily drain the write buffer.
On an Alpha, a CPU is allowed to reorder loads and stores.
First consider a single CPU system. The CPU can hang on to modified
data for an arbitrary time and it can do arbitrary prefetches. It
can reorder stores relative to other stores. It can reorder loads
relative to other loads. It can reorder loads/stores. It must look
to the programmer as if it is doing things in the order specified.
EV6s have write-back caches so writes to the outside world can
easily get reordered and have huge delays.
Now consider multiple CPUs. The question is if CPU #1 executes a
sequence of loads and stores, will CPU #2 see the same order? The
answer is that CPU #2 can see a different order unless you use the
MB.
With the MB, all the loads/stores before the MB appear to other CPUs
to happen before any of the loads/stores after the MB.
So the typical sequence of:
lock
critical-section-with-loads+stores
unlock
Should get turned into:
lock
MB
critical-section-with-loads+stores
MB
unlock
Obviously, the MBs can get folded into the lock/unlock macros.
A good term to use when discussing this is "leak". Think of the
nominal sequence of loads and stores as a reference, remembering
that the lock/unlock includes loads and stores. The first MB above
keeps any loads from "leaking" out of the critical section by getting
prefetched before the loads/stores used to get the lock. The second
MB keeps stores in the critical section from leaking out past the
store in the unlock, it prevents delayed writes.
MBs are expensive. You don't want to sprinkle them around just to
be safe. (Of course, you don't want to miss one.)
> While doing some hacking in the pmap module, it occurred to me that
> this would be a good idea for PTEs, too. But then I thought -- geez,
> where does it end? Do you issue an MB after frobbing with any random
> linked list structure? Etc.
I would expect SMP ready code would have a lock/mutex to protect
the linked list. If so, the MB as described above (and folded into
the lock/unlock macros) should do the right thing.
VAXes, for example, had microcode that did atomic queue manipulation
so they didn't need locks. If you have code like that you probably
need to add locks.
For very simple updates, you might be able to write some sneaky code
that does one ldx_l/stx_c rather than two. The idea is to do the
work directly rather than using a lock. (The restrictions on what
you can do between the ldx_l and the stx_c are tight.)
> Well, after disassembling OSF/1's simple lock routines, they clearly
> are using MBs after stl_c, and Brown Book (newer version of Green Book :-)
> doesn't say anything about stl_c performing a barrier. stl_c merely
> interacts with the "locked" register in the memory controller, as far
> as I can tell.
The stl_c doesn't do the barrier, just the conditional store.
> In any case, my question was mostly about things which are not using
> ldl_l/stl_c, but rather other global data structures. Like linked
> lists, PTEs, VM objects, etc. which are generally protected with
> locks, which would then have to ensure coherency of THOSE structures
> by issuing an MB.
I think most things will work correctly if the MB is included in
the lock/unlock macros. If I'm missing something, please let me
know.
------
PTEs are interesting. If you modify a PTE entry, you have to kick
the other CPUs so they can reload any TLBs that might be holding
that info. I think you can avoid that if you are changing a PTE
from invalid to valid.
Note that Alpha TLBs have an address-space slot that gets compared
like address bits, so even if you know that a process isn't running
on a CPU, there might be TLBs with leftover info. (The idea is
to speed up context switches.)
------
The other tricky area that I know about is writes to PCI devices.
If you want to make sure that a write to a card has happened (say
to turn off an interrupt), you have to read that location. This
is ugly because IO reads are painfully slow.
There is no way to really flush things. Even if you get them out
of the Alpha, they can get hung up in PCI-PCI bridges and such.
So the pattern is:
write magic location
MB
read magic location
The MB makes sure the read doesn't get prefetched inside the Alpha.
Once the writes are outside the Alpha, the PCI rules require that
the writes happen before the read does.
If you can't read that location because of side-effects, you can
do a MB, then a dummy write to another location, and then read that
location instead.
If you are writing control blocks on the device and then setting
a flag to start activity, you probably need an MB to make sure the
write-to-start doesn't get reordered ahead of the writes for the
control info.
------
> 1. alpha_mb();
> 2. alp->lock_data = 0;
> 3. alpha_mb();
...
> 3. Ensure that the lock is released before anything else happens.
The second MB isn't necessary, at least if you are using the typical
lock, critical-section, unlock pattern. No other CPU can get the
lock until they see the write that unlocks the lock. (Remember that
the MB isn't flusing any buffers, just making sure that the order
as seen by other CPUs is correct.)