the 'qt' network driver, a pool cache corruption error, and a proposed fix

To: port-vax%NetBSD.org@localhost
Subject: the 'qt' network driver, a pool cache corruption error, and a proposed fix
From: Kalvis Duckmanton <kalvisd%gmail.com@localhost>
Date: Fri, 17 Dec 2021 17:15:48 +1100

Hi all,

The recent thread about dhcpd not working under NetBSD/vax on SimH hasreminded me of another network-related problem which I'd found a whileago. I'd like to get some feedback on whether the observations, myunderstanding of what is happening, the conclusions and the proposed fixare reasonable.

The problem was found in NetBSD 9.99.69, running under SimH V4.0-0 usingthe 'XQ' network adapter.

To trigger the problem required a kernel with the DEBUG and DIAGNOSTICoptions set. The assertions in sys/kern/subr_pool.c:pool_get() andsys/kern/subr_pool.c:pool_cache_get_paddr() checking to see if a poolcan be accessed from interrupt context also needed to be disabled. Thisseems reasonably safe as these assertions are not checked if the DEBUGoption is not set. (Further investigation to understand why theassertions are failing has not been done.)

(The problem is believed to still be present in NetBSD 9.99.92 butmasked by a change in version 1.276 of sys/kern/subr_pool.c to

skip redzone on pools with the allocation (including all overhead)on anything greater than half the pool pagesize.


)

The problem is this:

With the qt0 interface enabled, after a small number of frames have beenreceived, the following message is seen on the console


[  17.3100030] qt0: srr=01777777777777777710002<NXM>

Sometimes the kernel panics:

panic: pool_redzone_check: [mclpl] 0x08 != 0xb9
Stack traceback :
0x94303d3c: vpanic+0x189(0x8036e5bd,0x94303ddc)
0x94303d5c: snprintf+0x0(0x8036e5bd,0x8036f46b,0x80373652,0x8,0xb9)
0x94303d90: pool_redzone_check.part.10+0x72(0x8ff39268,0x8f89f240)
0x94303df0: pool_cache_put_paddr+0x20(0x8ff39268,0x8f89f240,0xffffffff)
0x94303e14: m_ext_free+0x15b(0x8f8b2334)
0x94303e48: m_free+0x60(0x8f8b2334)
0x94303e70: m_freem.part.8+0x15(0x8f8b2334)
0x94303e98: m_freem+0x11(0x8f8b2334)
0x94303ebc: ether_input+0x370(0x80dc3104,0x8f8b2334)
0x94303ee4: if_percpuq_softint+0x80(0x8ffbde20)
0x94303f20: softint_dispatch+0xaf(0x8fec4100,0xc)
0x94303f68: softint_process+0xa(0)

Further debugging narrowed the problem down to the receive ring buffer. This buffer is implemented as a set of mbufs, allocated when the driveris initialised (in if_ubaminit() called from qtinit()). The mbufs useexternal buffers from the "mclpl" pool cache. The flags passed topool_cache_init() to create the "mclpl" cache are such that the poolheaders may be placed on the same pages as objects in the pool, and sothe offsets within a page of the mbufs' external buffers end up beinginconsistent - all aligned to COHERENCY_UNIT, but not necessarily allthe same.

if_ubaminit() also initialises page table entries for DMA and the bufferdescription list for the receive ring buffer; the resulting busaddresses in the buffer description list will also have differingoffsets from the beginning of a page.

When a frame is received, a filled buffer is removed from the receivering buffer, and a new one is substituted, by updating the correspondingpage table entries to refer to the new buffer. The entries in thebuffer description lists are not changed - the hardware still uses thesame bus address, which now might not have the right offset for the newpage, resulting in a subsequent frame being written outside the areaallocated for it - over the pool header, or into the red zone, or ontoan invalid page.

The fix was to change the arguments passed to pool_cache_init() to keepthe COHERENCY_UNIT alignment but set the PR_NOTOUCH flag to keep the thepool headers separate from the objects in the pool.

This seems to work - but is this an appropriate solution? Is there abetter one?


thanks

kalvis


diff --git a/sys/kern/uipc_mbuf.c b/sys/kern/uipc_mbuf.c
index 836aa621c43b..185dd928b863 100644
--- a/sys/kern/uipc_mbuf.c
+++ b/sys/kern/uipc_mbuf.c
@@ -184,7 +184,7 @@ mbinit(void)
         NULL, IPL_VM, mb_ctor, NULL, NULL);
     KASSERT(mb_cache != NULL);

-    mcl_cache = pool_cache_init(mclbytes, COHERENCY_UNIT, 0, 0, "mclpl",

+ mcl_cache = pool_cache_init(mclbytes, COHERENCY_UNIT, 0,PR_NOTOUCH, "mclpl",

         NULL, IPL_VM, NULL, NULL, NULL);
     KASSERT(mcl_cache != NULL);

Prev by Date: Re: qt multi- and broadcast (was Re: dhcpcd not working in simh-vax with xq0:nat networking)
Next by Date: KA410 booting 9.2
Previous by Thread: dhcpcd not working in simh-vax with xq0:nat networking
Next by Thread: KA410 booting 9.2
Indexes:

Home | Main Index | Thread Index | Old Index