tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Deadlock on fragmented memory?
[Cc'ing yamt@ and para@, in case they're not reading tech-kern@ right
now, since they know far more about allocators than I do.]
> Date: Sun, 22 Oct 2017 22:32:40 +0200
> From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
>
> With a pullup of kern_exec.c 1.448-1.449, to netbsd-6, we're still seeing
> hangs on vmem.
Welp. At least it's not an execargs hang!
I hypothesize that this may be an instance of a general problem with
chaining sleeping allocators: To allocate a foo, first allocate a
block of foos; then allocate a foo within the block. (Repeat
recursively for a few iterations: 1KB foos, 4KB pages of foos, 128KB
blocks of pages, &c.)
- Suppose thread A tries to allocate a foo, and every foo in every
block allocated so far is currently in use. Thread A will proceed
to try to allocate a block of foos. If there's not enough KVA to
allocate a block of foos, thread A will sleep until there is.
- Suppose thread B comes along and frees a foo. That doesn't wake
thread A, because there's still not enough KVA to allocate a block
of foos. So thread A continues to hang -- forever, if KVA is too
fragmented.
Even if thread A eventually makes progress, every time this happens,
it will allocate a new block of foos instead of reusing a foo from an
existing block.
And if there's no bound on the number of threads waiting to allocate a
block of foos (as is the case, I think, with pools), then under bursts
of heavy load there may be lots of nearly empty foo blocks allocated
simultaneously, which makes fragmentation even worse.
Thread A _should_ make progress if a foo is freed up, but it doesn't:
we have no mechanism by which multiple different signals can cause a
thread to wake, short of sharing the condition variables for them and
restarting every cascade of blocking allocations from the top.
This won't always happen:
- In the case of execargs buffers, this _won't_ happen (now) because
each execargs buffer is uvm_km_allocated one at a time, not in
blocks, so as long as the page daemon runs and there is an unused
execargs buffer, shrinking exec_pool will free enough KVA in
exec_map to allow a blocked uvm_km_alloc to continue and thereby
allow a blocked pool_get to continue.
- But in the case of pathbufs, they're 1024 bytes apiece, allocated in
4KB pages from kmem_va_arena, which in turn are allocated from
qcached chunks of 128KB blocks from kmem_va_arena, which in turn are
allocated from kmem_arena. And there are no 128KB regions left in
kmem_arena, according to your `show vmem', which (weakly) supports
this hypothesis.
To really test this hypothesis, you also need to check either
(a) for pages of pathbufs with free pathbufs in pnbuf_cache, or
(b) for blocks with free pages in kmem_va_arena's qcache.
I'm a little puzzled about the call stack. By code inspection, it
seems the call stack should look like:
cv_wait(&kmem_arena->vm_cv, &kmem_arena->vm_lock)
vmem_xalloc(kmem_arena, #x20000, ...)
vmem_alloc(kmem_arena, #x20000, ...)
vmem_xalloc(kmem_va_arena, #x20000, ...)
vmem_alloc(kmem_va_arena, #x20000, ...)
qc_poolpage_alloc(...qc...)
pool_grow(...qc...)
* pool_get(...qc...)
* pool_cache_get_paddr(...qc...)
* vmem_alloc(kmem_va_arena, #x1000, ...)
* uvm_km_kmem_alloc(kmem_va_arena, #x1000, ...)
* pool_page_alloc(&pnbuf_cache->pc_pool, ...)
* pool_allocator_alloc(&pnbuf_cache->pc_pool, ...)
* pool_grow(&pnbuf_cache->pc_pool, ...)
pool_get(&pnbuf_cache->pc_pool, ...)
pool_cache_get_slow(pnbuf_cache->pc_cpus[curcpu()->ci_index], ...)
pool_cache_get_paddr(pnbuf_cache, ...)
pathbuf_create_raw
The starred lines do not seem to appear in your stack trace. Note
that immediately above pool_get in your stack trace, which presumably
passes &pnbuf_cache->pc_pool, is a call to pool_grow for a _different_
pool, presumably the one inside kmem_arena's qcache.
Home |
Main Index |
Thread Index |
Old Index