Port-mips archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
port-mips/55062 (Failed assertion in pmap_md_tlb_check_entry())
I filed this bug yesterday, although I've seen it a few times in the last couple of weeks running -current on my Qube2. Note, the Cobalt GENERIC kernel has DEBUG turned on by default, and this assertion is in an #ifdef DEBUG code block.
The panic always looks like this:
kernel diagnostic assertion \"pte == xpte\" failed: file \"../../../../arch/mips/mips/pmap_machdep.c\", line 871 pmap=0x80641be4 va=0xc3018000 asid=0: TLB pte (0x7e01f) != real pte (0x1/0x1) @ 0x80712
It's not always exactly the same, but it follows a pattern -- TLB pte has a real value, and real pte (the PTE in the software copy of the page tables kept in the pmap structure) has has the "kernel invalid PTE" value (only the G bit set).
The assertion is called from a sanity check routine called by pmap_update(). This sanity check walks the hardware TLB to ensure that any valid hardware TLB entries match what the software copy in the pmap says.
So my initial thought was that a TLB invalidation was missing, so I copied the check to the places that remove mappings so see if I could narrow down which one it was.
Turns out it was pmap_kremove(), so I saved the start and end VA that was being removed in a couple of globals so I could examine them from DDB when the assertion tripped. I've been able to reproduce it fairly reliably by doing a "cvs co" of the NetBSD source tree, and so:
db> x/x pmap_kremove_sva
netbsd:pmap_kremove_sva: c100c000
db> x/x pmap_kremove_eva
netbsd:pmap_kremove_eva: c101c000
db> x/s panicstr
netbsd:panicstr: 4\334g\200kernel diagnostic assertion "pte == xpte" fail
ed: file "../../../../arch/mips/mips/pmap_machdep.c", line 871 pmap=0x80641d24 v
a=0xc2be2000 asid=0: TLB pte (0x3061f) != real pte (0x1/0x1) @ 0x807117c4
db> bt
0x807bdcd0: cpu_Debugger+4 (1,2d4,8067ddbc,803b931c) ra 803b9598 sz 0
0x807bdcd0: vpanic+15c (1,2d4,8067ddbc,803b931c) ra 804cf03c sz 48
0x807bdd00: kern_assert+3c (1,805540ac,80556114,80555c20) ra 800177d0 sz 32
0x807bdd20: pmap_md_tlb_check_entry+108 (1,805540ac,80556114,80555c20) ra 800178
b8 sz 80
0x807bdd70: tlb_walk+a0 (1,805540ac,80556114,80555c20) ra 8001cf3c sz 56
0x807bdda8: pmap_tlb_check+50 (1,805540ac,80556114,80555c20) ra 800195c4 sz 32
0x807bddc8: pmap_kremove+6c (1,805540ac,80556114,80555c20) ra 80325778 sz 40
0x807bddf0: uvm_pagermapout+24 (1,805540ac,80556114,80555c20) ra 803260cc sz 48
0x807bde20: uvm_aio_aiodone+a4 (1,805540ac,80556114,80555c20) ra 80408d90 sz 64
0x807bde60: biodone2+7c (1,805540ac,80556114,80555c20) ra 80408f2c sz 40
0x807bde88: biointr+a0 (1,805540ac,80556114,80555c20) ra 8037f3e0 sz 64
0x807bdec8: softint_dispatch+11c (1,805540ac,80556114,80555c20) ra 80001288 sz 1
28
0x807bdf48: softint_fast_dispatch+78 (1,805540ac,80556114,80555c20) ra 0 sz 24
User-level: pid 0.4
db>
So, what I glean from this is:
1- A 64K region is being unmapped -- probably a UBC window?
2- uvm_pagermapout() calls pmap_update() immediately after pmap_kremove(), which would have also tripped the assertion.
3- The failed VA wasn't in the region that is being unmapped by pmap_kremove().
So, if I'm doing this check IMMEDIATELY after invalidating the mapping, what could be the source of the inconsistency when removing a mapping?
Removing a mapping is actually two step process:
1- Invalidate the entry in the software copy of the page tables.
2- Invalidate the hardware TLB entry for that VA+ASID.
It's implemented in pmap_pte_remove():
pmap_md_tlb_miss_lock_enter();
pte_set(ptep, npte);
if (__predict_true(!(pmap->pm_flags & PMAP_DEFERRED_ACTIVATE))) {
/*
* Flush the TLB for the given address.
*/
pmap_tlb_invalidate_addr(pmap, sva);
}
pmap_md_tlb_miss_lock_exit();
Note the calls to pmap_md_tlb_miss_lock_enter() / pmap_md_tlb_miss_lock_exit(). These are these as hooks to handle CPUs like some powerpc booke CPUs that share a single hardware TLB among multiple logical CPUs. Those calls are no-ops on MIPS (they are #define'd away as nothing).
On a MIPS system, there is no need to put a lock around that short window because it does't matter if it's interrupted ... the API contract is that the mapping isn't invalidated until pmap_kremove() returns...
BUT...
...note the stack trace... in every case I've hit, it's been from a (soft)intr context.
I'm pretty sure what's happening is that, when the cosmic rays align correctly, a disk i/o completion interrupt comes in between pte_set() and pmap_tlb_invalidate_addr() (or at least before the kernel-invalid entry is actually written to the hardware), which causes the TLB consistency check called from that (soft)intr context to fail.
If pmap_md_tlb_miss_lock_enter() went to splvm(), that race window would close.
Any other thoughts?
-- thorpej
Home |
Main Index |
Thread Index |
Old Index