Re: Strange hang on 4.99.49

To: current-users%netbsd.org@localhost
Subject: Re: Strange hang on 4.99.49
From: Marc Tooley <netbsdMLpostNO%spam.quake.ca@localhost>
Date: Fri, 25 Jan 2008 09:33:28 -0800

On Tuesday 22 January 2008, Paul Goyette wrote:
> On Wed, 23 Jan 2008, Zafer Aydogan wrote:
> > 2008/1/22, Paul Goyette <paul%whooppee.com@localhost>:
> >> I'm running a kernel + userland from sources as of just a few
> >> hours ago.
> >>
> >> I started up a build.sh with -j 4 and all of a sudden, after an
> >> hour or more of running, the system just went completely idle even
> >> though the build is nowhere near complete.
> >>
> >> Looking around, I found three cc1 processes (only 3, not 4).  One
> >> of them was in a 'vnode' state while the other two are in
> >> 'vnlock'.  Both my /usr/src and /usr/obj are null-mounts
> >>
> >>        /dev/wd0a on / type ffs (NFS exported, local)
> >>        /dev/wd0e on /var type ffs (local)
> >>        /dev/wd0g on /home type ffs (NFS exported, local)
> >>
> >>>>>     /dev/wd0h on /build type ffs (soft dependencies, NFS
> >>>>> exported, local)
> >>
> >>        /dev/wd1a on /amanda type ffs (local)
> >>
> >>>>>     /build/src on /usr/src type null (local)
> >>>>>     /build/obj on /usr/obj type null (local)
> >>
> >>        /build/xsrc on /usr/xsrc type null (local)
> >>        /build/pkgsrc on /usr/pkgsrc type null (local)
> >>        kernfs on /kern type kernfs (local)
> >>        ptyfs on /dev/pts type ptyfs (local)
> >>        tmpfs on /tmp type tmpfs (local)
> >>
> >> Is there some sort of locking race condition that I've managed to
> >> hit? If so, is there a way to avoid it?
> >
> > I'm having the same problems with LFS.
> > Processes get stuck in vnode, especially when read/write is
> > involved. FFS seems not to be affected.
>
> Well, since I couldn't kill any of the processes, I tried to shut the
> system down.  It hung for several hours at "dismounting file systems"
> and I finally had to hit the reset button.
>
> If there's anything I can do to help debug this, I'll be happy to
> help. In the mean time I'm going to try again to see if I can get a
> build.sh running - this time I think I'll try without -j4 and see if
> it gets any further.

I had an almost identical problem happen with my system. System specs:

Intel(R) Pentium(R) D CPU 3.00GHz (dual-core, EM64T)
Intel D945GNT motherboard
total memory = 3062 MB

... in my case, a long time ago the AMD64 port basically refused to run 
and would crash constantly (port-amd64/34122,) so I switched to running 
in 32-bit i386 mode. From about 3.99.23 up until somewhere around 
4.99.15 or so, i386 was rock-solid on that machine. Literally, the 
3.99.23 kernel was so solid I have an uptime of well over a year on 
another machine:

:?:09:56:53 ~# uptime
 9:56AM  up 381 days,  3:26, 20 users, load averages: 0.21, 0.30, 0.28

... and that's running multiple MySQL databases, ntpd, Apache, thousands 
of emails in and out per day, it's acting as an OpenVPN hub for about 
six machines, an active NFS server, a Squid-based anonymizing web 
proxy, a BIND9 name server, a busy Samba server for about five Windows 
machines, a CVS pserver for NetBSD, DragonFlyBSD, MythTV, and a half 
dozen others, and is running about six Perforce daemons on it tracking 
hundreds of thousands of files from lots of open source applications 
along with local branches filled with customizations and modifications.

And then recently when I tried to update to -current on the other 
machine (this is back to the Pentium D again) around 4.99.45 or so, it 
began getting strange freezes. Existing processes kept going, but new 
processes would fail with complaints about too many processes in the 
system (when there were only about four of them running.) It was a lot 
like the livelock problems reported recently on -sparc64.

A build.sh on it would trigger it within a dozen minutes, and 
the machine would need to be reset. Andrew Doran thought I was running 
softdeps, which I was, but when I shut down softdeps the issues 
continued.

http://mail-index.netbsd.org/current-users/2008/01/11/0018.html
http://mail-index.netbsd.org/current-users/2008/01/11/0020.html

I've now updated the system to the AMD64 port, 4.99.49 updated as of a 
few days ago, and the original installation problems are fixed, but 
building anything on it returns all sorts of nasty problems.

I get these errors in dmesg:

free inode /v/3430621 had 4 blocks
free inode /v/3430622 had 4 blocks
free inode /v/3430623 had 4 blocks
free inode /v/3430624 had 4 blocks

And when building various versions of NetBSD:

/v/netbsd-build/netbsd-3.1_STABLE/i386/TOOLS/bin/i386--netbsdelf-strip -g -o 
netbsd netbsd.gdb
echo  netbsd | 
GZIP=-9 /v/netbsd-build/netbsd-3.1_STABLE/i386/TOOLS/bin/nbpax -O -zw -M -N 
/v/src-3-build/etc -f 
/v/netbsd-build/netbsd-3.1_STABLE/i386/REL/i386/binary/sets/kern-GENERIC.tgz
[1]   Segmentation fault (core dumped) GZIP=-9 /v/netbs...

And when building various packages from pkgsrc inside a pkg_comp chroot:

Processing hints file hints/netbsd.pl
Unable to find a perl 5 (by these names: ../../miniperl miniperl perl 
perl5 perl5.8.8, in these 
dirs: ../.. /pkg_comp/obj/pkgsrc/lang/perl5/default/.wrapper/bin 
/pkg_comp/obj/pkgsrc/lang/perl5/default/.buildlink/bin 
/pkg_comp/obj/pkgsrc/lang/perl5/default/.tools/bin 
/pkg_comp/obj/pkgsrc/lang/perl5/default/.gcc/bin /usr/pkg/bin /sbin /usr/sbin 
/bin /usr/bin /usr/pkg/sbin /usr/pkg/bin /usr/X11R6/bin /usr/local/sbin 
/usr/local/bin /usr/pkg/bin /usr/X11R6/bin /usr/pkg/bin)
Writing Makefile for DynaLoader
sh: /pkg_comp/obj/pkgsrc/lang/perl5/default/perl-5.8.8/ext/DynaLoader/0: 
not found
*** Error code 127

... while others are filled with:

/etc/ld.so.conf: invalid/unknown sysctl for libm.so.0 (22)
/etc/ld.so.conf: invalid/unknown sysctl for libm.so.0 (22)
/etc/ld.so.conf: invalid/unknown sysctl for libm.so.0 (22)

Anyway, off to try an update. :)

-Marc

References:
- Strange hang on 4.99.49
  - From: Paul Goyette
- Re: Strange hang on 4.99.49
  - From: Paul Goyette

Prev by Date: Re: interrupt storm after resume on Thinkpad T61
Next by Date: Re: interrupt storm after resume on Thinkpad T61
Previous by Thread: Re: Strange hang on 4.99.49
Next by Thread: Re: Strange hang on 4.99.49
Indexes:

Home | Main Index | Thread Index | Old Index