Subject: Re: 3ware card troubles?
To: Andrew Doran <ad@interlude.eu.org>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: tech-kern
Date: 04/05/2003 15:58:25
Hello. OK. After a lot of instrumentation, I believe I've solved the
the problem. It turns out that there is a queue management issue between
the ld.c driver and the underlying raid drivers. Ldstrategy checks to see
if the outstanding request queue is full. If it is, it doesn't issue
anymore requests to the underlying raid controller. However, it looks like
the ldstart() and lddone() routines assume that the underlying raid routines
have only executed one request between the time ldstart and lddone are
called. However, both the twe.c and amr.c drivers try to execute all
available requests in the queue at the time they're called. As a result,
the counter which ldstart and lddone increment and decrement respectively
becomes inaccurate, reflecting more requests in the queue than are actually
there. Eventually, when the queue counter indicates the queue is full, the
ld driver waits for the queue to drain, but since no one is calling ldstart
anymore, the queue never drains. As long as the load is light and requests
are completed as they come in, everything works fine. But when multiple
requests start coming in, the mis-counter eventually brings things to a
halt.
So, the question is, should we change the underlying drivers to deal
with only one request at a time, or modify the ld driver to manage the
request queue in a different manner. My inclination is to fix the ld
driver, but I'm open to suggestions and/or corrections to my findings. As
a reference, the sd driver looks like it relies on sddone() to determine if
if sdstart needs to be called again by looking at actual buffers, rather
than queue counters. this seems like an inherently better approach.
I'll think on this and try some patches. If others want to offer
suggestions, I'm all ears.
-Brian