tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
> Date: Mon, 17 Jul 2023 12:57:42 +0900
> From: PHO <pho%cielonegro.org@localhost>
>
> While further testing my DRM/KMS vmwgfx driver patches by playing
> games/minetest from pkgsrc, I experienced inexplicable kernel panics on
> this code:
>
> https://github.com/depressed-pho/netbsd-src/blob/91daa67f17222da355d3fddd6fa849c786d9c545/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c#L289
A note about one of the problems there:
spin_unlock(f->lock);
ret = dma_fence_add_callback(f, &cb.base, vmwgfx_wait_cb);
spin_lock(f->lock);
#if defined(__NetBSD__)
/* This is probably an upstream bug: there is a time window between
* the call of vmw_fence_obj_signaled() above, and this
* dma_fence_add_callback(). If the fence gets signaled during it
* dma_fence_add_callback() returns -ENOENT, which is really not an
* error condition. By the way, why the heck does dma_fence work in
* this way? If a callback is being added but it lost the race, why
* not just call it immediately as if it were just signaled?
*/
Not an upstream bug -- I introduced this bug when I patched the code
that reached into the guts of what should be an opaque data structure
for direct modification, to use drm_fence_add_callback instead.
Need to look at the diff from upstream, not just the #ifdefs. Usually
I use #ifdef __NetBSD__ to mark NetBSDisms separately from Linuxisms,
and just patch the code when the patched code can use a common API
that isn't any one OSism.
In this case I don't even remember why I left any #ifdefs, was
probably just working fast to make progress on a large code base,
might have left the #ifdefs in for visual reference while I was
editing the code and forgot to remove them. Could also simplify some
of the lock/unlock cycles by doing that.
> cv_destroy(&cv); // <-- Panics!
>
> It seldom panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy() but not
> always. The panic seems to happen when cv_timedwait_sig() exits due to
> the timeout expiring before it gets signaled.
Confused by `seldom panics on ... but not always' -- was that supposed
to be `often panics on ... but not always', or is there a more
frequent panic than KASSERT(!cv_has_waiters(cv))?
What exactly is the panic you see and the evidence when you see it?
Stack trace, gdb print cb in crash dump?
Home |
Main Index |
Thread Index |
Old Index