Subject: Re: spl models and smp (was Re: Some interesting papers on BSD ...)
To: Gordon W. Ross <gwr@mc.com>
From: Terry Lambert <terry@lambert.org>
List: tech-kern
Date: 09/19/1996 12:24:31
[ ... SVR4/MP mutex/spl interaction ... ]
> Note that you MUST hold a mutex lock on some object that has both
> the mutex and a condition variable, adn the cv_timedwait_sig()
> does an atomic "block and release the mutex" while making you
> non-runnable, and later does an atomic "resume and take mutex."
> Interesting scheme, eh?
The one problem with this is that condition variables, as required,
must be synchronized across all processors in a shared area mapped
into the kernel adress space of all of them.
This seems a big hit on concurrency, to me.
I prefer a design which would include the ability to transparently
localize a mutex/condition-avliable based on the resource locality,
to a single CPU.
This implies that the mutex allocation is hierarchical, in the same
way that all processes are descendents of "init" (a "group leader"
is equivalent to a CPU locality for this analogy).
This allows the use of mutexes and condition variables which will
only invoke bus arbitration *as necessary*. The key is the ability
to predict deadlock conditions. You can do this by computing the
transitive closure over the hierarchy as if it were a directed
acyclic graph. There is code in both "tsort" and "gprof" that gives
an example of how this works in actual practice.
Because it is a hierarchy, this means you can "inherit to root" hints
to allow the computation to be more quickly achieved (or-and-xor-and
in the simplest case). The main enabling technology for making
hints so inexpensive is the use of lock intention modes, so that once
intention is established, a resync of the shared objects is not required
to prevent conflicting requests.
Clearly, we want to compromise and not propagate inheritance of hints
over the boundry to the common system instead of per-CPU objects. This
trades propagation bus overhead for run time overhead. In effect, the
"expensive" object references become slightly more expensive because
the propagation is not interleaved. In trade, we get get vastly
increased access concurrency without bus arbitration for "inexpensive"
(local to a single CPU context) objects.
It is then incumbent upon us to design using as few critical path
"expensive" objects as possible. For instance, we should use the
Dynix (Sequent) per CPU page pool design for local VM allocations,
and only go to a share (bus arbitrating) mutex to refill the local
page pool for a CPU from the global page pool, when we hit a high
water mark (or conversly, return pages to the global pool only when
we hit a low water mark). Unlike file system reentrancy, Sequent
did this right: it is intuitively scalable to N processors (it turns
out that Unisys has/had the best -- IMO -- FS reentrancy mechanisms).
Other issues, such as per FS reentrancy for FS's in a transition
kernel, can be handled by allocating an expensive global mutex for
the VFS subsystems (and one for each other kernel subsystem, to
achieve a per-subsystem granularity). At that point, it's possible
to "push down" the interfaces and kernel reentrancy trough the trap,
exception, and interrupt code, to gradually increase concurrency.
It also allows import of "foreign" file systems, drivers, and other
components by causing them to use the global mutex until such time
as they can be made "safely kernel reentrant" and "safely kernel
context reentrant", for kernel multithreading and SMP reeentrancy,
respectively.
I *highly* recommend "UNIX For Modern Architectures"; it is basically
a handbook on how to build SMP UNIX systems.
SMP should be the target, since the context handling necessary for
SMP buys you the ability to reenter on a single CPU context (for
kernel multithreading) for free. It also buys you the ability to
support kernel preemption (because of the multithreading contexts),
which is something that is necessary to support true RealTime
scheduling algorithms (like deadlining) and related issues (like
priority lending or RT event-based preemption).
So this discussion probably belongs on the SMP list...
Regards,
Terry Lambert
terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.