Subject: Re: generic HBA error messages on 1.6beta2
To: Jason R Thorpe <thorpej@wasabisystems.com>
From: Matthew Jacob <mjacob@feral.com>
List: port-alpha
Date: 07/10/2002 15:58:58
Jason- read the thread. The patches I offered for him to try already have
this.
On Wed, 10 Jul 2002, Jason R Thorpe wrote:
> On Wed, Jul 10, 2002 at 02:21:42AM +0200, Matthias Buelow wrote:
>
> > 1) the problem only appears to occur with machines with >= 1GB RAM
> > installed (as Mel Kravitz claims, who has seen the same problem),
>
> Yes, >= 1G causes the SGMAP code to be used.
>
> > 2) the problem only occurs here when the machine has been running for
> > at least 2-3 days, this might hint at some problem with higher
> > address spaces or physical memory or mappings, and the kernel
> > migrates some mappings or buffers slowly upwards over time,
> > making the problem appear after a couple of days,
>
> Hm. Well, pages are actually cycled through pretty quickly. My guess
> would instead be some kind of slow resource leak.
>
> > 3) the problem appears to be with the dma mapping of the host adapter,
> > or more generally; considering that Jason has made new SGMAP DMA
> > improvements a while ago (according to the /alpha webpage) this
> > might be a hint that something might be broken there (with the
> > direct-mapped DMA window, although it only mentions mbufs and
> > things being made "a bit more efficient" on the webpage),
>
> The improvements in question fixed some bugs, and also reduced resource
> usage on disk->memory transfers. Matt Thomas and I also recently fixed
> a serious SGMAP resource-leaking bug.
>
> Are you, per chance, running kernels built with "options DIAGNOSTIC"?
>
> > I haven't checked yet if the problem also occurs on the adaptec
> > controller (or at least, never have seen it for that one so far)
> > which is also installed in the system, which may or may not hint
> > at specific problems with the isp (qlogic) driver. I somehow doubt
> > that, though, but I of course can't tell.
>
> Here is what I would suggest:
>
> In isp_pci_dmasetup(), in the error case for bus_dmamap_load(), print
> out the errno. EAGAIN and ENOMEM will be common ... those can occur
> as transient errors due to temporary resource shortage ... the scsipi
> layer backs off in that case, and retries the command.
>
> --
> -- Jason R. Thorpe <thorpej@wasabisystems.com>
>