Subject: Re: kern/28594: 2.0_RC5 lock-up/loop in checkaliases()
To: None <andreas@planix.com>
From: Andreas Wrede <andreas@planix.com>
List: netbsd-bugs
Date: 01/02/2005 18:03:24
--Apple-Mail-4--301730350
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
On 9-Dec-04, at 9:30 AM, andreas@planix.com wrote:
>> Number: 28594
>> Category: kern
>> Synopsis: 2.0_RC5 lock-up/lookp in checkaliases()
>> Confidential: no
>> Severity: critical
>> Priority: high
>> Responsible: kern-bug-people
>> State: open
>> Class: sw-bug
>> Submitter-Id: net
>> Arrival-Date: Thu Dec 09 14:30:00 +0000 2004
>> Originator: Andreas Wrede <andreas@planix.com>
>> Release: NetBSD 2.0_RC5
>> Organization:
> Planix, Inc.
>> Environment:
>
>
> System: NetBSD whome.planix.com 2.0_RC5 NetBSD 2.0_RC5 (PLANIX) #11:
> Tue Nov 23 10:49:49 EST 2004
> root@willy.wrede.pvt:/u1/netbsd-2.0/obj/sys/arch/i386/compile.i386/
> PLANIX i386
> Architecture: i386
> Machine: i386
>> Description:
> Every second or third night, during one of the find's in
> /etc/{daily|security}, the NetBSD 2.0/i3896 server locks up.
> Keystrokes no
> longer echo on the serial console. Entering the debugger
> usually works. When trying to "reboot" , I get a "panic: lockmgr:
> locking
> against myself". At the time of the lock-up, one of the XServe RAID
> based
> 1TByte file system was mounted:
>
> df -h
> Filesystem Size Used Avail Capacity Mounted on
> /dev/raid1a 1.4G 1.1G 232M 83% /
> /dev/raid1e 2.0G 1.4G 479M 75% /var
> /dev/raid1f 3.9G 2.4M 3.7G 0% /u1
> /dev/sd0a 1.0T 326G 632G 34% /u5
> kernfs 1.0K 1.0K 0B 100% /kern
> procfs 4.0K 4.0K 0B 100% /proc
>
> /u5 is a ffsv1 filesystem 32 blocks short of the 1Tb mark. The same
> lock-up
> occurs when the /u5 filesystem is 1Tb+ ffv2.
>
> Stopped in pid 25005.1 (find) at netbsd:cpu_Debugger+0x4:
> leave
> db> bt
> cpu_Debugger(cc0e4b8c,c037ddb0,cc0e4b74,7ff,c1557000) at
> netbsd:cpu_Debugger+0x4
> comintr(c12b8200,0,cb7d0010,30,cc0e0010) at netbsd:comintr+0x6b9
> Xintr_legacy4() at netbsd:Xintr_legacy4+0xa4
> --- interrupt ---
> checkalias(cfdf8688,120c,c12e6000,cfdfa160,c1908000) at
> netbsd:checkalias+0x5e
> ufs_vinit(c12e6000,c128c300,c128c200,cc0e4ca8,c23528c0) at
> netbsd:ufs_vinit+0x69
> ffs_vget(c12e6000,3978196,cc0e4d64,d595eb70,cc0e4cf8) at
> netbsd:ffs_vget+0x274
> ufs_lookup(cc0e4d94,cfdf8540,cc0e4dac,c037d409,c05730a0) at
> netbsd:ufs_lookup+0x6d4
> VOP_LOOKUP(cf3ed444,cc0e4e84,cc0e4e98,cc0e4e84,c0573820) at
> netbsd:VOP_LOOKUP+0x2e
> lookup(cc0e4e74,cbfa6c00,400,cc0e4e8c,cc0e4e24) at netbsd:lookup+0x201
> namei(cc0e4e74,8081448,60,0,8081540) at netbsd:namei+0x138
> sys___lstat13(cd02e2ac,cc0e4f64,cc0e4f5c,0,c153f000) at
> netbsd:sys___lstat13+0x58
> syscall_plain() at netbsd:syscall_plain+0x7e
> --- syscall (number 280) ---
> 0x480e7357:
I had two more lock-ups in checkaliases(), after upgrading the kernel
to 2.0 release. Checking the checkalias() routine in kern/vfs_subr.c
in current, I find the changes made by mycroft in
revision 1.231:
date: 2004/08/13 22:48:06; author: mycroft; state: Exp; lines: +59
-54
There is an annoying deadlock that goes like this:
* Process A is closing one file descriptor belonging to a device. In
doing so,
ffs_update() is called and starts writing a block synchronously.
(Note: This
leaves the vnode locked. It also has other instances -- stdin, et al
-- of
the same device open, so v_usecount is definitely non-zero.)
* Process B does a revoke() on the device. The revoke() has to wait
for the
vnode to be unlocked because ffs_update() is still in progress.
* Process C tries to open() the device. It wedges in checkalias()
repeatedly
calling vget() because it returns EBUSY immediately.
It looks like the deadlock is triggered for me by a find (from
/etc/daily/weekly) and other accesses by imap/mail and or webserver.
Should rev 1.231 be pulled up to the 2.0 branch? And is the change in
1.231 alone sufficient or are there other pre/corequisite patches
needed?
--
aew
--Apple-Mail-4--301730350
content-type: application/pgp-signature; x-mac-type=70674453;
name=PGP.sig
content-description: This is a digitally signed message part
content-disposition: inline; filename=PGP.sig
content-transfer-encoding: 7bit
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
iD8DBQFB2H3CEh/h9J/TQyERAqEnAJ4hb5+AccrMYaxWdbxe/PaS1riUIgCfeK0W
qHsawkeE9S0mnWZkNejPUTk=
=LZ2w
-----END PGP SIGNATURE-----
--Apple-Mail-4--301730350--