Port-vax archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
the 'qt' network driver, a pool cache corruption error, and a proposed fix
Hi all,
The recent thread about dhcpd not working under NetBSD/vax on SimH has
reminded me of another network-related problem which I'd found a while
ago. I'd like to get some feedback on whether the observations, my
understanding of what is happening, the conclusions and the proposed fix
are reasonable.
The problem was found in NetBSD 9.99.69, running under SimH V4.0-0 using
the 'XQ' network adapter.
To trigger the problem required a kernel with the DEBUG and DIAGNOSTIC
options set. The assertions in sys/kern/subr_pool.c:pool_get() and
sys/kern/subr_pool.c:pool_cache_get_paddr() checking to see if a pool
can be accessed from interrupt context also needed to be disabled. This
seems reasonably safe as these assertions are not checked if the DEBUG
option is not set. (Further investigation to understand why the
assertions are failing has not been done.)
(The problem is believed to still be present in NetBSD 9.99.92 but
masked by a change in version 1.276 of sys/kern/subr_pool.c to
skip redzone on pools with the allocation (including all overhead)
on anything greater than half the pool pagesize.
)
The problem is this:
With the qt0 interface enabled, after a small number of frames have been
received, the following message is seen on the console
[ 17.3100030] qt0: srr=01777777777777777710002<NXM>
Sometimes the kernel panics:
panic: pool_redzone_check: [mclpl] 0x08 != 0xb9
Stack traceback :
0x94303d3c: vpanic+0x189(0x8036e5bd,0x94303ddc)
0x94303d5c: snprintf+0x0(0x8036e5bd,0x8036f46b,0x80373652,0x8,0xb9)
0x94303d90: pool_redzone_check.part.10+0x72(0x8ff39268,0x8f89f240)
0x94303df0: pool_cache_put_paddr+0x20(0x8ff39268,0x8f89f240,0xffffffff)
0x94303e14: m_ext_free+0x15b(0x8f8b2334)
0x94303e48: m_free+0x60(0x8f8b2334)
0x94303e70: m_freem.part.8+0x15(0x8f8b2334)
0x94303e98: m_freem+0x11(0x8f8b2334)
0x94303ebc: ether_input+0x370(0x80dc3104,0x8f8b2334)
0x94303ee4: if_percpuq_softint+0x80(0x8ffbde20)
0x94303f20: softint_dispatch+0xaf(0x8fec4100,0xc)
0x94303f68: softint_process+0xa(0)
Further debugging narrowed the problem down to the receive ring buffer.
This buffer is implemented as a set of mbufs, allocated when the driver
is initialised (in if_ubaminit() called from qtinit()). The mbufs use
external buffers from the "mclpl" pool cache. The flags passed to
pool_cache_init() to create the "mclpl" cache are such that the pool
headers may be placed on the same pages as objects in the pool, and so
the offsets within a page of the mbufs' external buffers end up being
inconsistent - all aligned to COHERENCY_UNIT, but not necessarily all
the same.
if_ubaminit() also initialises page table entries for DMA and the buffer
description list for the receive ring buffer; the resulting bus
addresses in the buffer description list will also have differing
offsets from the beginning of a page.
When a frame is received, a filled buffer is removed from the receive
ring buffer, and a new one is substituted, by updating the corresponding
page table entries to refer to the new buffer. The entries in the
buffer description lists are not changed - the hardware still uses the
same bus address, which now might not have the right offset for the new
page, resulting in a subsequent frame being written outside the area
allocated for it - over the pool header, or into the red zone, or onto
an invalid page.
The fix was to change the arguments passed to pool_cache_init() to keep
the COHERENCY_UNIT alignment but set the PR_NOTOUCH flag to keep the the
pool headers separate from the objects in the pool.
This seems to work - but is this an appropriate solution? Is there a
better one?
thanks
kalvis
diff --git a/sys/kern/uipc_mbuf.c b/sys/kern/uipc_mbuf.c
index 836aa621c43b..185dd928b863 100644
--- a/sys/kern/uipc_mbuf.c
+++ b/sys/kern/uipc_mbuf.c
@@ -184,7 +184,7 @@ mbinit(void)
NULL, IPL_VM, mb_ctor, NULL, NULL);
KASSERT(mb_cache != NULL);
- mcl_cache = pool_cache_init(mclbytes, COHERENCY_UNIT, 0, 0, "mclpl",
+ mcl_cache = pool_cache_init(mclbytes, COHERENCY_UNIT, 0,
PR_NOTOUCH, "mclpl",
NULL, IPL_VM, NULL, NULL, NULL);
KASSERT(mcl_cache != NULL);
Home |
Main Index |
Thread Index |
Old Index