Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Acer M5229 IDE bugs (esp. on sparc64)



Rafal Boni wrote:
Folks:
        The discussion re: Acer M5229 IDE controllers in the V100 (my
        main home file / mail / shell server) and a link I'd saved from
        long ago in my bookmarks which I happened to trip over today
        made me dig into the issues I've been seeing on my SunFire V100,
        namely a pretty steady stream of:

wd1d: DMA error reading fsbn 160621568 of 160621568-160621631 (wd1 bn 177397712;
 cn 175989 tn 12 sn 44), retrying
wd1: soft error (corrected)

        messages in /var/log/messages.  The disks seem to work just fine
        for some reason, but it does make me worry.

        Here's the IDE controller / disk related bits of my boot messages:

aceride0 at pci0 dev 13 function 0
aceride0: Acer Labs M5229 UDMA IDE Controller (rev. 0xc3)
aceride0: bus-master DMA support present
aceride0: primary channel configured to native-PCI mode
aceride0: using ivec 180c for native-PCI interrupt
atabus0 at aceride0 channel 0
aceride0: secondary channel configured to native-PCI mode
atabus1 at aceride0 channel 1
wd0 at atabus0 drive 0: <ST3120026A>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 111 GB, 232581 cyl, 16 head, 63 sec, 512 bytes/sect x 234441648 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(aceride0:0:0): using PIO mode 4, Ultra-DMA mode 4 (Ultra/66) (using DMA)
atapibus0 at atabus1: 2 targets
cd0 at atapibus0 drive 1: <CD-224E, , P.9A> cdrom removable
cd0: drive supports PIO mode 4, DMA mode 2
wd1 at atabus1 drive 0: <ST3120026A>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 111 GB, 232581 cyl, 16 head, 63 sec, 512 bytes/sect x 234441648 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(aceride0:1:0): using PIO mode 4, Ultra-DMA mode 4 (Ultra/66) (using DMA)
cd0(aceride0:1:1): using PIO mode 4, DMA mode 2 (using DMA)

        As I said, I tripped over a bookmark I'd make to a FreeBSD commit
        which supposedly fixed this ([1]), which claims that the firmware
        incorrectly sets the device to "use the ATA66 byte counter instead
        of triggering an interrupt at the zero count of the transfer buffer
counter" (see bug report and analysis in [2]).

Some more testing says this patch, nor any additional rearranging of the aceride register writes (per Linux/FreeBSD drivers) makes any difference.

What is interesting is that if I add the following bit of debug to ata/ata_wdc.c:

Index: ata/ata_wdc.c
===================================================================
RCS file: /cvsroot/src/sys/dev/ata/ata_wdc.c,v
retrieving revision 1.87
diff -u -p -b -u -p -r1.87 ata_wdc.c
--- ata/ata_wdc.c       19 Oct 2007 11:59:36 -0000      1.87
+++ ata/ata_wdc.c       14 Feb 2008 18:57:03 -0000
@@ -687,6 +687,9 @@ wdc_ata_bio_intr(struct ata_channel *chp
                }
                if (wdc->dma_status != 0) {
                        if (drv_err != WDC_ATA_ERR) {
+ printf("%s:%d:%d: DMA error (st=0x%x, er=0x%x)\n
",
+ atac->atac_dev.dv_xname, chp->ch_channel, + xfer->c_drive, wdc->dma_status, ata_bio->r_error);
                                ata_bio->error = ERR_DMA;
                                drv_err = WDC_ATA_ERR;
                        }

I see that wdc->dma_status is always 0x04 (WDC_DMAST_UNDER), which is a synthetic error generated only by pciide_dma_finish(). I'm guessing that the suspect pciiide_dma_finish() is the one called from wdcintr(). Because the rev of the M1559 IDE controller I have doesn't have a chan-id register to determine which channel caused an interrupt, for this chip we end up *always* checking both channels, and the code in wdcintr() / pciide_dma_finish() looks very suspicious... stop DMA first, ask questions later.

FreeBSD looks like it does the checks a bit differnetly (their whole handling of the Bus-Master DMA status register -- IDEDMA_CTL -- is a bit different, but they key is they look like they check the active status (IDEDMA_CTL_ACT) before attempting to kill the dma vs. our kill-1st, check if active later.

--rafal

PS: There's still the issue of LBA48 UDMA non-support in the aceride hardware revs <= 0xc4. That's not bothering me, so I'll probably just turn it into a PR.



Home | Main Index | Thread Index | Old Index