Subject: Re: kern/18636: Multiple uvm_pagefaults
To: Don Phillips <don@resun.com>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-bugs
Date: 10/13/2002 20:48:32
On Sun, Oct 13, 2002 at 11:31:46AM -0700, Don Phillips wrote:
> >>>>> "Manuel" == Manuel Bouyer <bouyer@antioche.eu.org> writes:
> 
> [...]
> 
> Manuel> Well, this really looks like a hardware problem. 
> 
> 'Twas my initial thought, however:
> 
> Memtest86 was run in extended mode.

each time I tried Memtest86, it didn't find problems in memory modules
which had real problems .

> 
> I spent a month tracking the problem down to a SW subsystem.  I
> replaced all of the memory.  I replaced the MB and processor.  I
> reproduced it in both environments.  MBs were from two different
> manufacturers.
> 
> Yep.  I, too, thought it was HW.  'I are a SW engineer.'  I'd say
> that unless we've managed to find a flaw in two different
> motherboards, running two different processors (AMD K-6, Athelon
> 1.53GZ), utilizing new memory modules, with a new HD, it would seem
> to pretty solidly point to something in SW, and since the kernel
> crashes, I'd say that the kernel, at a minimum, owns a piece of the
> problem.
> 
> The only HW in common between the two systems were network cards.
> Yesterday, on 1.6, I made a kernel with all network cards and the
> MII/PHYs disabled (but left the cards in place).

You should also pull out the network card.
Did you also change power supplies ?

> 
> Manuel> I have various i386 systems running 1.6, all of them have
> Manuel> been stable.
> 
> Wouldn't surprise me.  I sincerely believe that the release was well
> tested before it's release.  :-)
> 
> Manuel> I'me even been pushing a system hard this week-end (to test
> Manuel> a machine rebuild from various pieces of hardware) running
> Manuel> make -j20 or make -j40 (depending on the amout of RAM)
> Manuel> kernel builds, and a build.sh -j10.  I've tested various RAM
> Manuel> config from 32M to 128M.
> 
> Ah!  Now, we've got a difference that may lead to something.  The
> configs that are breaking are 384MB (128MB+256MB) and 512MB.  And
> the 1.5.2 failures, I believe, are after I upgraded from 256M to
> 384M.
> 
> And it would explain why I'm seeing the problem, but nobody else.  I
> have *lots* of memory.

That's not lots these days. cvs.netbsd.org is running 1.5.3 with 384MB
(and I believe has run previous releases with that amount of memory too).
ftp.netbsd.org is running 1.6 with 3.5GB. For sure there are lots of other
machines with more than 256MB running NetBSD out there.

> 
> So, maybe a boundary condition, somewhere in the uvm system, for
> large memory systems?  It doesn't always break.  1.5.2 is stable,
> unless I'm running the SW system that uses DBs with tables of
> 1.2GB.

You should try to upgrade to 1.5.3, at last.
The database use is probably special and could point out bugs.
But core dumps from kernel compiles is something a lot of peoples do.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
--