Subject: Re: SMP API things, lock debugging, etc.
To: Jason Thorpe <thorpej@nas.nasa.gov>
From: Stefan Grefen <Stefan.Grefen@tantau.com>
List: tech-smp
Date: 07/28/1999 10:38:20
Jason Thorpe wrote:
> 
> On Tue, 27 Jul 1999 23:27:08 +0200
>  Stefan Grefen <Stefan.Grefen@tantau.com> wrote:
> 
>  > OK lets define the (S)MP semantics of spl*().
> 
> Okay!  :-)
> 
>  > Do we block the interrupts on the affected processor or on all processors?
> 
> It blocks the specified interrupts on the "current processor", not all.
> 
>  > Only if it blocks interrupts on all processors this method prevents
>  > deadlocks.
> 
> How so?
> 
>         CPU 0: s = splfoo(), simple_lock(foo_lock)
>         CPU 1: foointr (implicit splfoo), simple_lock(foo_lock) ... spins
>         CPU 0: finishes with foo, simple_unlock(foo_lock), splx(s)
>         CPU 1: acquires foo_lock, does its thing, simple_unlock(foo_lock)
> 
> Can you describe to me a situation where failing to block interrupts on
> all processors will cause deadlock (cases where one wasn't careful in
> designing the locking protocol for the given subsystem don't count :-)
> 

The not careful case is exactly what I have in mind. The current lock protocol
was designed for an environment where certain events can't happen.
Basically the kernel is not preemptive and there is no concurrent activity while 
running an interrupt function.
In this environment lock-order problems just don't exist as long as the locks are
on the same spllevel. And I'm pretty sure there are some violations against a 
strict lock-order in the code. It is not the locking protocol per subsystem,
but the interaction between the subsystems. It is hard to get that right.

splbar<splfoo

       CPU 0: s=splhigh(), simple_lock(foo_lock); simple_lock(bar_lock)
       CPU 1: s=splbar(), simple_lock(bar_lock); foointr (implicit splfoo), simple_lock(foo_lock) ... spins

This is completely legal in the old BSD semantic (replace lock_xx with access to resource xx)
but will lead to a deadlock situation. The  simple_lock(bar_lock) can happen just by
calling a function in a different subsystem. Unless we want to enforce lock-order
across subsystems (which probably boils down to "don't hold a lock while calling
a function which may lock") and document it in detail, I doubt we get a stable
system that way. 
My strategy for NetBSD SMP would be, simple-locks with global splxxx() to get
a SMP system up and running, and than think about doing locks the right way.

What the right way is depends on the system you design for (shared memory SMP,
NUMA, .... non-shared memory  cluster).

BTW.
As locks on some systems need to be cache-line sized (HP) we may want to group 
those locks (or put them in a hashtable).
Which leads to a missing function:

void cpu_simple_lock_destroy(__volatile struct simplelock *alp) 

         With Lockdebug or diagnostic, complain if lock is still locked.
          initialize to illegal value for debugging
         a nop on most systems without Lockdebug or diagnostic.


With hashed locks this is needed to remove the lock from the hashlist.

Stefan





         


>         -- Jason R. Thorpe <thorpej@nas.nasa.gov>