x86 bus_dmamap_sync, again

To: port-amd64%NetBSD.org@localhost, port-i386%NetBSD.org@localhost
Subject: x86 bus_dmamap_sync, again
From: Taylor R Campbell <campbell%mumble.net@localhost>
Date: Sat, 16 Jul 2022 19:01:04 +0000

I reviewed the x86 bus_dmamap_sync with reference to the AMD64 memory
ordering table[1], and I think the fences in bus_dmamap_sync are
wrong: in some cases it isn't strong enough, and in other cases it's
stronger than it needs to be.

In particular:

- I think PREREAD needs SFENCE -- because it's not about ordering
  prior memory reads with later reads from a DMA buffer, but rather
  ordering _all_ prior use of the DMA buffer (including, e.g.,
  memset(buf, 0, n) to zero the buffer before it is reused) with a
  later _write_ to trigger DMA.  It doesn't nead LFENCE at all because
  reads prior to the write that trigger DMA don't need to be ordered.

- I think neither PREWRITE nor POSTWRITE needs MFENCE -- because
  MFENCE is ony ever needed for store-before-load ordering, which is
  not relevant here; SFENCE suffices for PREWRITE (ensure prior use of
  the buffer has completed before the write that triggers DMA) and nop
  suffices for POSTWRITE (the read that triggers DMA is never delayed
  until after a subsequent store because x86 never reorders load;store
  to store;store, and executing a later read sooner is safe because
  the buffer isn't changing anyway).

- The POSTREAD case has two LFENCEs, one before reading into the
  bounce buffer and one after -- but the second LFENCE is unnecessary;
  the first LFENCE suffices to ensure the caller, or the bouncer,
  reads the DMA results and not stale memory out of the buffer.  In
  the bounce case any use of the user's DMA buffer carries a data
  dependency through the bounce buffer so no barrier is needed.

So, in brief, the fences _after bouncing_ (to or from the bounce
buffer, depending on DMA direction), are:

        op                      currently       needed
        PREREAD                 lfence          sfence
        PREWRITE                mfence          sfence
        PREREAD|PREWRITE        mfence          sfence
        POSTREAD                lfence          nop[*]
        POSTWRITE               mfence          nop
        POSTREAD|POSTWRITE      mfence          nop[*]

[*] These cases still need to issue LFENCE _before bouncing_.

(SFENCE is only needed if the DMA buffer or the DMA descriptor live in
wc/wc+ memory -- i.e., prefetchable.  Often the buffer lives in
regular memory and the DMA is triggered by a write to uc memory for
memory-mapped I/O, for which no fence is needed.  But I'm not 100%
sure this applies to all drivers.  For instance, i915 command queues
may be mapped write-combining and have pointers to regular buffers,
necessitating SFENCE -- of course, that's moot for the i915 driver
because it's upstream Linux code, but it illustrates the semantic
need.)

I suggest the attached change to address this, and to explain what's
going on with specific reference to the manual.  Thoughts?


[1] AMD64 Architecture Programmer's Manual, Volume 2: System
    Programming, AMD Publication No. 24593, Revision 3.38, 2021-10,
    Sec. 7.4.2 Memory Barrier Interaction with Memory Types, Table
    7-3, p. 196.
    https://web.archive.org/web/20220625040004/https://www.amd.com/system/files/TechDocs/24593.pdf#page=256

From c20ae086125b6526acd9c42987b1d728673e2548 Mon Sep 17 00:00:00 2001
From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Date: Tue, 12 Jul 2022 21:20:27 +0000
Subject: [PATCH] x86: Adjust fences issued in bus_dmamap_sync after bouncing.

And expand the comment on the lfence for POSTREAD before bouncing.

Net change:

        op                      before          after
        PREREAD                 lfence          sfence
        PREWRITE                mfence          sfence
        PREREAD|PREWRITE        mfence          sfence
        POSTREAD                lfence          nop[*]
        POSTWRITE               mfence          nop
        POSTREAD|POSTWRITE      mfence          nop[*]

The case of PREREAD is as follows:

1. loads and stores before DMA buffer may be allocated for the purpose
2. bus_dmamap_sync(BUS_DMASYNC_PREREAD)
3. store to register or DMA descriptor to trigger DMA

The register or DMA descriptor may be in any type of memory (or I/O).

lfence at (2) is _not enough_ to ensure stores at (1) have completed
before the store in (3) in case the register or DMA descriptor lives
in wc/wc+ memory, or the store to it is non-temporal: in that case,
it may executed early before all the stores in (1) have completed.

On the other hand, lfence at (2) is _not needed_ to ensure loads in
(1) have completed before the store in (3), because x86 never
reorders load;store to store;load.  So we may need to enforce
store/store ordering, but not any other ordering, hence sfence.

The case of PREWRITE is as follows:

1. stores to DMA buffer (and loads from it, before allocated)
2. bus_dmamap_sync(BUS_DMASYNC_PREWRITE)
3. store to register or DMA descriptor to trigger DMA

Ensuring prior loads have completed is not necessary because x86
never reorders load;store to store;load (and in any case, the device
isn't changing the DMA buffer, so it's safe to read over and over
again).  But we must ensure the stores in (1) have completed before
the store in (3).  So we need sfence, in case either the DMA buffer
or the register or the DMA descriptor is in wc/wc+ memory or either
store is non-temporal.  But we don't need mfence.

The case of POSTREAD is as follows:

1. load from register or DMA descriptor notifying DMA completion
2. bus_dmamap_sync(BUS_DMASYNC_POSTREAD)
   (a) lfence [*]
   (b) if bouncing, memcpy(userbuf, bouncebuf, ...)
   (c) ???
3. loads from DMA buffer to use data, and stores to reuse buffer

This certainly needs an lfence to prevent the loads at (3) from being
executed early -- but bus_dmamap_sync already issues lfence in that
case at 2(a), before it conditionally loads from the bounce buffer
into the user's buffer.  So we don't need any _additional_ fence
_after_ bouncing at 2(c).

The case of POSTWRITE is as follows:

1. load from register or DMA descriptor notifying DMA completion
2. bus_dmamap_sync(BUS_DMASYNC_POSTWRITE)
3. loads and stores to reuse buffer

Stores at (3) will never be executed early because x86 never reorders
load;store to store;load for any memory types.  Loads at (3) are
harmless because the device isn't changing the buffer -- it's
supposed to be fixed from the time of PREWRITE to the time of
POSTWRITE as far as the CPU can witness.

Reference:

AMD64 Architecture Programmer's Manual, Volume 2: System Programming,
24593--Rev. 3.38--November 2021, Sec. 7.4.2 Memory Barrier Interaction
with Memory Types, Table 7-3, p. 196.
https://www.amd.com/system/files/TechDocs/24593.pdf
---
 sys/arch/x86/x86/bus_dma.c | 77 ++++++++++++++++++++++++++++----------
 1 file changed, 58 insertions(+), 19 deletions(-)

diff --git a/sys/arch/x86/x86/bus_dma.c b/sys/arch/x86/x86/bus_dma.c
index ec26be4f61f6..9d39560647b9 100644
--- a/sys/arch/x86/x86/bus_dma.c
+++ b/sys/arch/x86/x86/bus_dma.c
@@ -789,6 +789,13 @@ _bus_dmamap_unload(bus_dma_tag_t t, bus_dmamap_t map)

 /*
  * Synchronize a DMA map.
+ *
+ * Reference:
+ *
+ *	AMD64 Architecture Programmer's Manual, Volume 2: System
+ *	Programming, 24593--Rev. 3.38--November 2021, Sec. 7.4.2 Memory
+ *	Barrier Interaction with Memory Types, Table 7-3, p. 196.
+ *	https://web.archive.org/web/20220625040004/https://www.amd.com/system/files/TechDocs/24593.pdf#page=256
  */
 static void
 _bus_dmamap_sync(bus_dma_tag_t t, bus_dmamap_t map, bus_addr_t offset,
@@ -816,11 +823,17 @@ _bus_dmamap_sync(bus_dma_tag_t t, bus_dmamap_t map, bus_addr_t offset,
 #endif

 	/*
-	 * The caller has been alerted to DMA completion by reading a
-	 * register or DMA descriptor, and is about to read out of the
-	 * DMA memory buffer that the device filled.  LFENCE ensures
-	 * that these happen in order, so that the caller doesn't
-	 * proceed to read any stale data from cache or speculation.
+	 * BUS_DMASYNC_POSTREAD: The caller has been alerted to DMA
+	 * completion by reading a register or DMA descriptor, and the
+	 * caller is about to read out of the DMA memory buffer that
+	 * the device just filled.
+	 *
+	 * => LFENCE ensures that these happen in order so that the
+	 *    caller, or the bounce buffer logic here, doesn't proceed
+	 *    to read any stale data from cache or speculation.  x86
+	 *    never reorders loads from wp/wt/wb or uc memory, but it
+	 *    may execute loads from wc/wc+ memory early, e.g. with
+	 *    BUS_SPACE_MAP_PREFETCHABLE.
 	 */
 	if (ops & BUS_DMASYNC_POSTREAD)
 		x86_lfence();
@@ -954,20 +967,46 @@ _bus_dmamap_sync(bus_dma_tag_t t, bus_dmamap_t map, bus_addr_t offset,
 		break;
 	}
 end:
-	if (ops & (BUS_DMASYNC_PREWRITE|BUS_DMASYNC_POSTWRITE)) {
-		/*
-		 * from the memory POV a load can be reordered before a store
-		 * (a load can fetch data from the write buffers, before
-		 * data hits the cache or memory), a mfence avoids it.
-		 */
-		x86_mfence();
-	} else if (ops & (BUS_DMASYNC_PREREAD|BUS_DMASYNC_POSTREAD)) {
-		/*
-		 * all past reads should have completed at before this point,
-		 * and future reads should not have started yet.
-		 */
-		x86_lfence();
-	}
+	/*
+	 * BUS_DMASYNC_PREREAD: The caller may have previously been
+	 * using a DMA memory buffer, with loads and stores, and is
+	 * about to trigger DMA by writing to a register or DMA
+	 * descriptor.
+	 *
+	 * => SFENCE ensures that the stores happen in order, in case
+	 *    the latter one is non-temporal or to wc/wc+ memory and
+	 *    thus may be executed early.  x86 never reorders
+	 *    load;store to store;load for any memory type, so no
+	 *    barrier is needed for prior loads.
+	 *
+	 * BUS_DMASYNC_PREWRITE: The caller has just written to a DMA
+	 * memory buffer, or we just wrote to to the bounce buffer,
+	 * data that the device needs to use, and the caller is about
+	 * to trigger DMA by writing to a register or DMA descriptor.
+	 *
+	 * => SFENCE ensures that these happen in order so that any
+	 *    buffered stores are visible to the device before the DMA
+	 *    is triggered.  x86 never reorders (non-temporal) stores
+	 *    to wp/wt/wb or uc memory, but it may reorder two stores
+	 *    if one is to wc/wc+ memory, e.g. if the DMA descriptor is
+	 *    mapped with BUS_SPACE_MAP_PREFETCHABLE.
+	 */
+	if (ops & (BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE))
+		x86_sfence();
+
+	/*
+	 * BUS_DMASYNC_POSTWRITE: The caller has been alerted to DMA
+	 * completion by reading a register or DMA descriptor, and the
+	 * caller may proceed to reuse the DMA memory buffer, with
+	 * loads and stores.
+	 *
+	 * => No barrier is needed.  Since the DMA memory buffer is not
+	 *    changing (we're sending data to the device, not receiving
+	 *    data from the device), prefetched loads are safe.  x86
+	 *    never reoreders load;store to store;load for any memory
+	 *    type, so early execution of stores prior to witnessing
+	 *    the DMA completion is not possible.
+	 */
 }

 /*

Prev by Date: HEADS UP - Fixed major for VirtualBox Guest Additions (vboxguest)
Next by Date: pcdisplay
Previous by Thread: HEADS UP - Fixed major for VirtualBox Guest Additions (vboxguest)
Next by Thread: pcdisplay
Indexes:

Home | Main Index | Thread Index | Old Index