NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/56952: UVM deadlock in madvise vs. munmap
>Number: 56952
>Category: kern
>Synopsis: UVM deadlock in madvise vs. munmap
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Aug 03 20:20:00 +0000 2022
>Originator: David A. Holland
>Release: NetBSD 9.99.97 (20220602)
>Organization:
>Environment:
System: NetBSD valkyrie 9.99.97 NetBSD 9.99.97 (VALKYRIE_LOCKDEBUG) #1: Wed Jun 22 23:56:00 EDT 2022 dholland@valkyrie:/usr/src/sys/arch/amd64/compile/VALKYRIE_LOCKDEBUG amd64
Architecture: x86_64
Machine: amd64
>Description:
I have a few times hit a deadlock while running some database stress
tests, and today caught it with UVM_PAGE_TRKOWN enabled.
The dead state is as follows:
Thread 1 is in madvise(MADV_DONTNEED) and is holding a read lock on
the process's map. It is waiting in putpages to chuck one of the pages.
Thread 2 is in uvm_fault_internal; it is holding the page and trying
to get a read lock on the map.
Thread 3 is in munmap; it is waiting for a write lock on the map, and
that converts this into a deadlock.
(This is all in one process.)
Taylor constructed the following narrative for how it got this way
(any transcription errors are my fault):
<Riastradh> Presumably you have an object foo which is mapped at
0xdeadbee000 in the address space
<Riastradh> 1. Someone tried to read from page 0xdeadbef000, say,
which is the range [0x1000, 0x2000) in foo.
<Riastradh> They consulted the map which determined that range in foo.
<Riastradh> They released the map lock, then allocated a page and
punched it into foo, and they want to reacquire the map lock to
punch it into the pmap.
<Riastradh> 2. Someone else tried to madvise(MADV_DONTNEED) some
range, say [0xdeadbee000, 0xdeadbf6000), in foo, and chuck all the
pages.
<Riastradh> Took the map read lock to that 0xdeadbef000 is mapped to
foo@0x1000, entered genfs_io_chuck_all_the_pages or whatever, and
then started waiting for the page that (1) allocated for
foo@0x1000.
<Riastradh> Except I got the order wrong again and this last player
actually started first, but whatever.
<Riastradh> 3. At the same time, someone else tried to unmap
0xdeadbef000, which requires taking a _write_ lock.
<Riastradh> which threw a wrench in the whole thing
<Riastradh> So, one obvious possibility is: make uvm_map_clean drop
the map lock while doing genfs_io_chuck_all_the_pages.
<Riastradh> (pgo_put)
>How-To-Repeat:
>Fix:
Oof.
Home |
Main Index |
Thread Index |
Old Index