NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/45631: ABA problem in subr_pcq.c?
>Number: 45631
>Category: kern
>Synopsis: ABA problem in subr_pcq.c?
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Nov 19 00:15:00 +0000 2011
>Originator: Dennis Ferguson
>Release: 5.99.52
>Organization:
>Environment:
NetBSD cow.pa.akit-ferguson.com 5.99.52 NetBSD 5.99.52 (GENERIC) #0: Wed Jun 8
01:49:43 PDT 2011
dennis%cow.pa.akit-ferguson.com@localhost:/usr/obj/sys/arch/amd64/compile/GENERIC
amd64
>Description:
Either the lockless queuing algorithm in subr_pcq.c has an ABA problem, or I'm
misunderstanding something fundamental and need to be told to get lost.
Suppose pcq_put() is called when the queue is empty. In this case we will have
pcq->pcq_producer == pcq->pcq_consumer
and all pointers in the ring buffer will be NULL. Now suppose that pcq_put()
is executed up to the point indicated below, after
the
producer = pcq->pcq_producer;
but before the attempt to write *producer, and then is interrupted by something
else:
/*
* Get our starting point, While we are doing this, it is
* imperative that pcq->pcq_base/pcq->pcq_limit not change
* in value. If you need to resize a pcq, init a new pcq
* with the right size and swap pointers to it.
*/
membar_consumer(); /* see updates to pcq_producer */
producer = pcq->pcq_producer;
for (;;) {
/*
* Preadvance so we reduce the window on updates.
*/
pcq_entry_t * const
new_producer = pcq_advance(pcq, producer);
====> stops here =====>
/*
* Try to fill an empty slot
*/
if (NULL == atomic_cas_ptr(producer, NULL, item)) {
At this point it has read pcq->pcq_producer, but hasn't done anything else.
Now suppose two things happen during that interrupt:
- Another thread calls pcq_put() (multiple producers are allowed according to
pcq(9)) and adds something else to the queue. When it completes this we'll
have:
pcq->pcq_producer == producer + 1;
*producer != NULL; /* A->B */
- The consumer now runs and calls pcq_get() to remove the thing just added.
When it finishes we'll have:
pcq->pcq_consumer == pcq->pcq_producer == producer + 1;
*producer == NULL; /* B->A */
- The original call to pcq_put() now resumes. It will execute the following
code:
if (NULL == atomic_cas_ptr(producer, NULL, item)) {
/*
* We need to use atomic_cas_ptr since another thread
* might have inserted between these two cas operations
* and we don't want to overwrite a producer that's
* more up-to-date.
*/
atomic_cas_ptr(&pcq->pcq_producer,
__UNVOLATILE(producer),
__UNVOLATILE(new_producer));
/*
* Tell them we were able to enqueue it.
*/
#ifndef __HAVE_ATOMIC_AS_MEMBAR
membar_producer();
#endif
return true;
}
The first CAS operation will succeed because *producer == NULL again. The
second CAS will fail (since pcq->pcq_producer != producer) but this is ignored.
It then returns success, but in fact the queue still looks empty from the
point of view of the consumer since
*pcq->pcq_consumer == NULL
at this point. The item queued might get consumed eventually, but it won't
until pcq->pcq_consumer has wrapped around the entire ring. The item is
orphaned until this happens.
The reason I began to look at this with suspicion is that I believe the call to
membar_consumer() at the start of pcq_put() is spurious (why would it want to
ensure that previous loads complete prior to the load of pcq->pcq_producer when
nothing it does depends on any previous loads, let alone their ordering with
respect to pcq->pcq_producer?) but the ABA issue is a way more serious problem,
if I'm not just confused.
>How-To-Repeat:
See above. I don't know how to arrange for that to happen in real life, but if
(when?) it does things will break.
>Fix:
I'm not sure, but I see papers all the time correcting bugs like this in
parallel algorithms which have been used for a decade. It might be better to
stick to well-vetted textbook algorithms rather than rolling one's own.
Home |
Main Index |
Thread Index |
Old Index