Subject: Re: mp3 on m68k (was headless)
To: John Klos <john@sixgirls.org>
From: Michael R.Zucca <mrz5149@acm.org>
List: port-mac68k
Date: 03/02/2003 18:45:00
On Sunday, March 2, 2003, at 05:11 PM, Riccardo Mottola wrote:
> on 3/2/03 8:12 PM, John Klos at john@sixgirls.org wrote:
> The problem would also remain SCSI access, which on the 840 is still
> dogslow
> and CPU consuming (no dma)
I'm working on that as we speak. I've gotten DMA to work, but I'm not
getting the big improvements in performance that I was expecting. Here
are some bonnie scores from my latest test kernel (1.5-pre 1.6 -current
vintage):
./bonnie -s 10 (This makes a 10 megabyte test file)
-------Sequential Output-------- ---Sequential Input--
--Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block---
--Seeks---
Kernel MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
/sec %CPU
PIO 0 533 98.9 2217 98.7 1449 98.3 474 99.3 3315 100.1
211.4 97.6
DMATEST 0 533 98.5 1737 69.7 1394 98.2 475 99.8 3469 99.9
212.9 99.4
So, on block I/O we're using less CPU (good) but we're getting less
throughput (bad). On writes we're using marginally less CPU and getting
only marginally better throughput. (Though, it seems to feel
"subjectively faster" when I run it with DMA :-) )
My theory at the moment is that the throughput isn't better because the
drives are not doing synchronous transfers and this is because when the
drive probe is going on at boot, the generic NCR SCSI code is using
unaligned, ECB buffers to do the synch negotiation. Thus, the first few
bytes of the transfer are done with PIO, and this seems to cause the
sync negotiation to fail. If I was able to use DMA for that
transaction, I think the sync negotiation would pass and we'd be seeing
much, much better bonnie scores. I've got a few ideas on how to do
this, but I'm still working on it. Right now I'm trying to use a DMA
bounce buffer to pick up short, unaligned transfers, but I'm having
some problems.
The biggest problem is that the SCSI DMA channel seems to want things
16 byte aligned and it wants transfer lengths to be a multiple of 16
bytes. This makes sense since the SCSI chip FIFO is 16 bytes deep and
the AV documentation lists the SCSI DMA channel buffer to be 16 bytes,
but its still a pain. :-)
As for CPU usage, we have to take an interrupt for each non-physically
contiguous block we transmit. I suspect we're not getting more than a
couple of contiguous blocks in a row, so we're still taking a number of
expensive SCSI interrupts. Right now, each segment requires a new SCSI
transaction. Ideally, I'd like to change the code so that the SCSI
transaction is setup once, and the DMA engine gets interrupted at the
end of a contiguous block which then loads the next block from the
dmamap for the transfer. I believe that the DMA engine is capable of
this, I just haven't gotten around to finding the interrupt and setting
up the infrastructure to do this. If it were done this way there would
still be interrupts for each block but they would be much less costly.
Though, I suspect interrupts on the 68k are so costly that I may only
see a few percentage points of improvement in CPU usage.
I've been contemplating releasing the above code so other folks can
play with it too. If folks are interested enough, I can apply a little
polish to what I've got and put it up for download. Though, be
forewarned, the code is still a bit primitive. I think we could be
doing a lot more with the DMA engine than what I'm doing currently.
Right now, the code uses PIO to 16 byte align transfer addresses and
lengths. All other transactions DMA directly to/from memory. For many
transactions that are page aligned, like paging to/from disk, the
entire transaction is done with DMA!
Any takers?
--
----------------------------------------------
Michael Zucca - mrz5149@acm.org
----------------------------------------------
"I'm too old to use Emacs." -- Rod MacDonald
----------------------------------------------