I thought this would be a good time to give an update on the status of the wrstuden-revivesa branch. I have been testing it on i386 inside of a VM Fusion virtual machine with two CPUs to the guest os. As of this afternoon, it is passing all but one of the libpthread regression tests. Unfortunately the one we don't pass is resolv, the most complicated one. Alot of the recent code quality stems from running a LOCKDEBUG DEBUG DIAGNOSTIC kernel. :-) Things that are needed: 1) Figure out the one remaining bug. It's manifested itself as either a lwp_locked(l, spc->spc_mutex) failed assertion at line 279 of kern_runq.c, or an attempt to deliver an upcall where both the event and interrupted thread are the same one. This latter issue is REALLY weird. Threads that have blocked in the kernel stop running when they wander into sa_unblock_userret(). They put themselves on a sleep queue (used to hold threads that are waiting for processing), then they should never run again until they've been reused. To that end, the line of code after mi_switch() is lwp_exit(). "being reused" involves a call to cpu_setfunc() which, as I understand it, resets the whole stack. So there is no way an lwp should run while it's on the sleepq. Yet I occasionally see a trap getting serviced on this lwp, and trying to return to userland. That of course explodes. I believe the problem is somewhere in my handling of sleep queues. I would appreciate suggestions. The flow of events is: Process makes itself an sa proces. This causes an lwp to be created and added to the savp_lwpcache sleepqueue. Its wchan is "lwpcache", and its mutex is savp_mutex, the virtual processor's mutex. At some point the existing thread blocks and we decide to generate a BLOCKED upcall. We wander into sa_switch() with the lwp locked. We lock the savp_mutex (carefully), and then do a hand-rolled sleepq_unsleep(l, false). sleepq_unsleep() degenerates into sleepq_remove(). Half of sleepq_remove() is done (in sa_getcachelwp(), the part that removes the thread from the queue), we do some work, then we do the rest of sleepq_remove() (the part that makes the new thread runnable). We then mi_switch() and go on our way. We eventually wake up, and the kernel finishes doing whatever it needs to do. The thread eventually ends up in userret(), which notices that we blocked. We then make sure the "blessed" lwp will notice us and we enqueue ourselves on the savp_woken sleepq. Then we mi_switch() away. If we ever come back, we lwp_exit(). The blessed lwp eventually returns to userland, and in userret() notices it needs to generate upcalls. So it generates an upcall for the lwps on savp_woken queue. The lwp that blocked then is put in the cache, and may get grabbed in the future when an lwp blocks. So something's a little off somewhere in the queue manipulation, and we don't have asserts that notice it quickly. Thoughts? Take care, Bill
Attachment:
pgpybWEu16qK2.pgp
Description: PGP signature