tech-net archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: BNX driver problem when mbuf clusters run out
On Apr 20, 2012, at 9:45 AM, Matt Thomas wrote:
> The patch has been updated a bit to include a missing bus_dmamap_sync
I do believe I have tracked down the problem. I have not added the
bus_dmamap_sync to my patch since Matt already got that one.
-Bev
Here's my description:
Problem 1:
In bnx_rx_intr, there are two places where the producer can get pushed
ahead of the consumer. In the while (sw_cons != hw_cons) loop,
there is a check for a packet whose layer 2 data has been corrupted,
and a check for being able to allocate a new mbuf cluster to be
placed in the receive queue. In both cases, if the test fails,
the mbuf cluster that has just been removed from the receive ring
buffer is put back, and the producer index is pushed forward. However,
the consumer index is not pushed forward, so the next time around,
the mbuf cluster is removed from where the consumer points, but a
new (or recycled) mbuf cluster is put where the producer points. This
over writes the pointer to another mbuf cluster. The mbuf cluster is
lost forever, and the receive buffer ends up filled with NULLs because
the consumer and produce indices stay out of sync. Eventually all
mbuf clusters are lost, and no more communication can happen over any
interface.
To fix this, I changed the action taken in each test in case of failure.
For the botched layer 2 data, I advance the consumer pointer, and then
break out of the loop. For the failure to get a new mbuf cluster,
I removed the call to recycle the mbuf and allow the packet to be
processed rather than end up dropped on the floor. This now prevents
live lock on the interface because incoming packets can still be
processed even if there are no mbuf clusters to be had. And because
the packet is processed, the consumer pointer is pushed forward.
The producer pointer stays put, because no mbuf has been put into
place in the receiver ring buffer.
In addition, I put a KASSERT in bnx_add_buf to check that no mbuf
cluster are being overwritten in the receive ring.
These fixes stopped the lossage of mbuf clusters.
Problem 2:
bnx_get_buf relies on the counter free_rx_bd to determine whether it
should obtain a new mbuf cluster for the receiver ring. bnx_rx_intr
increments this counter before checking if there was an mbuf available
to be removed. Thus, this counter could end up too high, causing
bnx_get_buf to obtain too many mbuf clusters. Once again, we can
leak mbuf clusters. I moved the increment to the place where the
mbuf cluster is actually removed from the receive ring.
Problem 3:
It is possible for the receive ring to completely lose all of its
mbuf clusters in the receive ring. If this happens, the driver is
dependent on bnx_tick to call bnx_get_buf to refill the ring.
bnx_tick did that, but then forgot to tell the chip about it.
So the chip thought it didn't have any buffers to fill, so it waited.
bnx_intr never got an indication that there were new packets to
read because the chip wasn't loading them, and thus, bnx_rx_intr
would never be called. This caused an individual interface to
freeze, but taking the interface down and up would fix the problem
since it would resync the cpu and the chip. Adding calls to
resync the chip after bnx_tick succeeds in getting new mbuf clusters
fixes the problem.
Here's my patch:
diff --git a/netbsd/src/sys/dev/pci/if_bnx.c b/netbsd/src/sys/dev/pci/if_bnx.c
index 711ec34..6fc815d 100644
--- a/netbsd/src/sys/dev/pci/if_bnx.c
+++ b/netbsd/src/sys/dev/pci/if_bnx.c
@@ -3730,6 +3730,7 @@ bnx_add_buf(struct bnx_softc *sc, struct mbuf *m_new, u_in
* last rx_bd entry to that rx_mbuf_ptr and rx_mbuf_map matches)
* and update counter.
*/
+ KASSERT(sc->rx_mbuf_ptr[*chain_prod] == NULL);
sc->rx_mbuf_ptr[*chain_prod] = m_new;
sc->rx_mbuf_map[first_chain_prod] = sc->rx_mbuf_map[*chain_prod];
sc->rx_mbuf_map[*chain_prod] = map;
@@ -4346,7 +4347,6 @@ bnx_rx_intr(struct bnx_softc *sc)
/* Get the used rx_bd. */
rxbd = &sc->rx_bd_chain[RX_PAGE(sw_chain_cons)][RX_IDX(sw_chain_
- sc->free_rx_bd++;
DBRUN(BNX_VERBOSE_RECV, aprint_error("%s(): ", __func__);
bnx_dump_rxbd(sc, sw_chain_cons, rxbd));
@@ -4393,6 +4393,7 @@ bnx_rx_intr(struct bnx_softc *sc)
/* Remove the mbuf from the driver's chain. */
m = sc->rx_mbuf_ptr[sw_chain_cons];
sc->rx_mbuf_ptr[sw_chain_cons] = NULL;
+ sc->free_rx_bd++;
/*
* Frames received on the NetXteme II are prepended
@@ -4445,7 +4446,8 @@ bnx_rx_intr(struct bnx_softc *sc)
panic("%s: Can't reuse RX mbuf!\n",
device_xname(sc->bnx_dev));
}
- continue;
+ sw_cons = NEXT_RX_BD(sw_cons);
+ break;
}
/*
@@ -4459,18 +4461,6 @@ bnx_rx_intr(struct bnx_softc *sc)
DBRUN(BNX_WARN, aprint_debug_dev(sc->bnx_dev,
"Failed to allocate "
"new mbuf, incoming frame dropped!\n"));
-
- ifp->if_ierrors++;
-
- /* Try and reuse the exisitng mbuf. */
- if (bnx_add_buf(sc, m, &sw_prod,
- &sw_chain_prod, &sw_prod_bseq)) {
- DBRUNIF(1, bnx_breakpoint(sc));
- panic("%s: Double mbuf allocation "
- "failure!",
- device_xname(sc->bnx_dev));
- }
- continue;
}
/* Skip over the l2_fhdr when passing the data up
@@ -5615,8 +5605,12 @@ bnx_tick(void *xsc)
prod_bseq = sc->rx_prod_bseq;
chain_prod = RX_CHAIN_IDX(prod);
bnx_get_buf(sc, &prod, &chain_prod, &prod_bseq);
- sc->rx_prod = prod;
- sc->rx_prod_bseq = prod_bseq;
+ if (sc->rx_prod != prod || sc->rx_prod_bseq != prod_bseq) {
+ sc->rx_prod = prod;
+ sc->rx_prod_bseq = prod_bseq;
+ REG_WR16(sc, MB_RX_CID_ADDR + BNX_L2CTX_HOST_BDIDX, prod);
+ REG_WR(sc, MB_RX_CID_ADDR + BNX_L2CTX_HOST_BSEQ, prod_bseq);
+ }
splx(s);
return;
}
Home |
Main Index |
Thread Index |
Old Index