Subject: libpthread killed my dog, part N+1
To: None <tech-userlevel@netbsd.org>
From: Charles M. Hannum <abuse@spamalicious.com>
List: tech-userlevel
Date: 01/06/2005 01:35:23
I have discovered another deadlock, *and* the reason for upcall exhaustion.
Let us review. When we receive a SA_UPCALL_UNBLOCKED for a thread holding a
spinlock, we caused an immediately switch to that thread from
pthread__resolve_locks(), presumably on the theory that it will finish and
unlock immediately. Note that at this point, pt_blockgen==pt_unblockgen+1;
pt_unblockgen gets incremented again after pthread__resolve_locks() returns
and we call pthread__sched_bulk().
However, it may happen that the thread blocks again. When this happens, we
now have a chain of upcall thread(s) implicitly blocked waiting for it. In
addition, pt_blockgen==pt_unblockgen+3.
Eventually we will get another SA_UPCALL_UNBLOCKED. When this happens, if we
are lucky, the thread will finish with the lock, and the hack in
pthread_spinunlock() will switch back to the upcall thread immediately. At
this point, pt_blockgen==pt_unblockgen+2 (because we received two unblocks).
At this point, the upcall chain will terminate, pthread__sched_bulk() will be
called, and because pt_unblockgen is already even, it will not be
incremented! Note that we are screwed now; various pieces of code will
evermore think that the thread is blocked. This leads to one form of
deadlock (signal delivery will never succeed, and the thread can get stuck
repeatedly taking a trap).
Even if I fix the even-odd test in pthread__sched_bulk(), this problem can
still lead to upcall exhaustion, by causing a chain of upcalls to be stuck.
I think -- but I'm not sure yet -- that they actually spin on the CPU,
waiting for the unblock that will allow them to continue.
Somehow, in all this mess, pthread__concurrency also becomes -1. I'm not sure
exactly how that happens.
This really needs to be fixed, somehow.