tech-userlevel: libpthread killed my dog, part N+1

Subject: libpthread killed my dog, part N+1
To: None <tech-userlevel@netbsd.org>
From: Charles M. Hannum <abuse@spamalicious.com>
List: tech-userlevel
Date: 01/06/2005 01:35:23

I have discovered another deadlock, *and* the reason for upcall exhaustion.

Let us review.  When we receive a SA_UPCALL_UNBLOCKED for a thread holding a 
spinlock, we caused an immediately switch to that thread from 
pthread__resolve_locks(), presumably on the theory that it will finish and 
unlock immediately.  Note that at this point, pt_blockgen==pt_unblockgen+1; 
pt_unblockgen gets incremented again after pthread__resolve_locks() returns 
and we call pthread__sched_bulk().

However, it may happen that the thread blocks again.  When this happens, we 
now have a chain of upcall thread(s) implicitly blocked waiting for it.  In 
addition, pt_blockgen==pt_unblockgen+3.

Eventually we will get another SA_UPCALL_UNBLOCKED.  When this happens, if we 
are lucky, the thread will finish with the lock, and the hack in 
pthread_spinunlock() will switch back to the upcall thread immediately.  At 
this point, pt_blockgen==pt_unblockgen+2 (because we received two unblocks).

At this point, the upcall chain will terminate, pthread__sched_bulk() will be 
called, and because pt_unblockgen is already even, it will not be 
incremented!  Note that we are screwed now; various pieces of code will 
evermore think that the thread is blocked.  This leads to one form of 
deadlock (signal delivery will never succeed, and the thread can get stuck 
repeatedly taking a trap).

Even if I fix the even-odd test in pthread__sched_bulk(), this problem can 
still lead to upcall exhaustion, by causing a chain of upcalls to be stuck.  
I think -- but I'm not sure yet -- that they actually spin on the CPU, 
waiting for the unblock that will allow them to continue.

Somehow, in all this mess, pthread__concurrency also becomes -1.  I'm not sure 
exactly how that happens.


This really needs to be fixed, somehow.