revivesa status

To: tech-kern%NetBSD.org@localhost
Subject: revivesa status
From: Bill Stouder-Studenmund <wrstuden%netbsd.org@localhost>
Date: Tue, 1 Jul 2008 16:29:42 -0700

I thought this would be a good time to give an update on the status of the 
wrstuden-revivesa branch.

I have been testing it on i386 inside of a VM Fusion virtual machine with 
two CPUs to the guest os.

As of this afternoon, it is passing all but one of the libpthread 
regression tests. Unfortunately the one we don't pass is resolv, the most 
complicated one. Alot of the recent code quality stems from running a 
LOCKDEBUG DEBUG DIAGNOSTIC kernel. :-)

Things that are needed:

1) Figure out the one remaining bug. It's manifested itself as either a 
lwp_locked(l, spc->spc_mutex) failed assertion at line 279 of kern_runq.c, 
or an attempt to deliver an upcall where both the event and interrupted 
thread are the same one.

This latter issue is REALLY weird. Threads that have blocked in the kernel
stop running when they wander into sa_unblock_userret(). They put
themselves on a sleep queue (used to hold threads that are waiting for
processing), then they should never run again until they've been reused.
To that end, the line of code after mi_switch() is lwp_exit(). "being
reused" involves a call to cpu_setfunc() which, as I understand it, resets
the whole stack. So there is no way an lwp should run while it's on the 
sleepq. Yet I occasionally see a trap getting serviced on this lwp, and 
trying to return to userland. That of course explodes.

I believe the problem is somewhere in my handling of sleep queues. I would 
appreciate suggestions.

The flow of events is:

Process makes itself an sa proces. This causes an lwp to be created and 
added to the savp_lwpcache sleepqueue. Its wchan is "lwpcache", and its 
mutex is savp_mutex, the virtual processor's mutex.

At some point the existing thread blocks and we decide to generate a 
BLOCKED upcall. We wander into sa_switch() with the lwp locked. We lock 
the savp_mutex (carefully), and then do a hand-rolled sleepq_unsleep(l, 
false). sleepq_unsleep() degenerates into sleepq_remove(). Half of 
sleepq_remove() is done (in sa_getcachelwp(), the part that removes the 
thread from the queue), we do some work, then we do the rest of 
sleepq_remove() (the part that makes the new thread runnable). We then 
mi_switch() and go on our way.

We eventually wake up, and the kernel finishes doing whatever it needs to 
do. The thread eventually ends up in userret(), which notices that we 
blocked. We then make sure the "blessed" lwp will notice us and we enqueue 
ourselves on the savp_woken sleepq. Then we mi_switch() away. If we ever 
come back, we lwp_exit().

The blessed lwp eventually returns to userland, and in userret() notices 
it needs to generate upcalls. So it generates an upcall for the lwps on 
savp_woken queue. The lwp that blocked then is put in the cache, and may 
get grabbed in the future when an lwp blocks.

So something's a little off somewhere in the queue manipulation, and we 
don't have asserts that notice it quickly. Thoughts?

Take care,

Bill

Attachment: pgpybWEu16qK2.pgp
Description: PGP signature

Follow-Ups:
- Re: revivesa status 2008/07/09
  - From: Bill Stouder-Studenmund

Prev by Date: Re: vwakeup: neg numoutput
Next by Date: Re: Newbie Kernel Programmer looking for first project.
Previous by Thread: compat linux SIOCGIFCONF fix for 64bit archs
Next by Thread: Re: revivesa status 2008/07/09
Indexes:

Home | Main Index | Thread Index | Old Index