nfs server lockup (again)

To: tech-kern%netbsd.org@localhost, tech-net%netbsd.org@localhost
Subject: nfs server lockup (again)
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Date: Fri, 17 Jun 2011 11:45:11 +0200
Hello,
some time ago I reported that a NFS server, rock solid running a 5.0_STABLE
kernel from february 2010, will lock up when upgraded to 5.1_STABLE.
I couldn't reproduce this on a testbed but I got the opportunity to try
again the faultly kernel on a production server and I can now report how
it deadlocks.

The root of the cause is a NFS process closing a socket:
db{0}> tr/a 0x00000000d01585c0
trace: pid 232 lid 19 at 0xd01dc9ac
sleepq_block(1,0,c063c259,c06a513c,c2027108,1,c0723d00,0,1,2df948) at 
netbsd:sleepq_block+0xeb
kpause(c063c259,0,1,0,c0715eca,c0387e83,c569f858,1000,d4263000,c569f800) at 
netbsd:kpause+0xe3
uvm_unloanpage(c569f858,1,2,c0123b65,c569f800,0,1000,0,c58902d8,4dfabc08) at 
netbsd:uvm_unloanpage+0x69
sodopendfreel(c0715ec8,0,d01dcabc,c03880a2,c58902d8,6,0,0,0,0) at 
netbsd:sodopendfreel+0xe1
sodopendfree(c58902d8,6,0,0,0,0,d01dcadc,cd5a4f80,0,c58902d8) at 
netbsd:sodopendfree+0x2a
sodisconnect(c58902d8,d01585c0,7,c0723e60,d6fdd700,4dfabc08,c0723e60,c032c4c2,d6fdd700,d6fdd700)
 at netbsd:sodisconnect+0x62
soclose(c58902d8,0,d01dcb4c,c031a3cd,d6fdd700,c0724060,d01dcb2c,0,c0272501,d01585c0)
 at netbsd:soclose+0x1b9
closef(d6fdd700,c02737e4,0,d4261d40,d4261d40,d01d93cc,d01dcbec,c027313d,d4261d40,d01d93cc)
 at netbsd:closef+0x5d
nfsrv_slpderef(d4261d40,d01d93cc,d01dcbd4,d01dcbda,c07277a0,c0724060,0,c06a7b18,c0724060,0)
 at netbsd:nfsrv_slpderef+0x9e
nfssvc_nfsd(d01dcc38,b8fffc28,d01585c0,0,0,0,c0724002,c0368cd4,0,d01585c0) at 
netbsd:nfssvc_nfsd+0x1cd
sys_nfssvc(d01585c0,d01dcd00,d01dcd28,0,d0153c80,0,0,4,b8fffc28,7c) at 
netbsd:sys_nfssvc+0x332
syscall(d01dcd48,b3,ab,1f,1f,0,b8e00000,b8fffcac,b8fffc28,4) at netbsd:syscall+0

The page passed to uvm_unloanpage() has a uobject, and for some reason
the lock was not available when uvm_unloanpage() tried to get it, so it
went to kpause. Note that the lock is now free:
db{0}> x/x c569f858
0xc569f858:     c2027108
db{0}> show page c2027108
PAGE 0xc2027108:
  flags=c<TABLED,CLEAN>, pqflags=200<PRIVATE2>, wire_count=0, pa=0x55ef8000
    uobject=0xd38d80b8, uanon=0x0, offset=0x0 loan_count=2
    [page ownership tracking disabled]

(gdb) print &((struct uvm_object*)0xd38d80b8)->vmobjlock
$3 = (kmutex_t *) 0xd38d80b8
db{0}> sh lock 0xd38d80b8
lock address : 0x00000000d38d80b8 type     :     sleep/adaptive
initialized  : 0x00000000c03a0746
shared holds :                  0 exclusive:                  0
shares wanted:                  0 exclusive:                  0
current cpu  :                  0 last held:                  2
current lwp  : 0x00000000cd5a7c80 last held: 000000000000000000
last locked  : 0x00000000c03a121e unlocked : 0x00000000c03a15d9
owner field  : 000000000000000000 wait/spin:                0/0
Turnstile chain at 0xc0723ce0.
=> No active turnstile for this lock.

It looks like this is a vnode interlock:
0xc03a121e is in vrelel (/home/src/src-5/src/sys/kern/vfs_subr.c:1497).
1492                     *
1493                     * Note that VOP_INACTIVE() will drop the vnode lock.
1494                     */
1495                    VOP_INACTIVE(vp, &recycle);
1496                    mutex_enter(&vp->v_interlock);
1497                    vp->v_iflag &= ~VI_INACTNOW;
1498                    cv_broadcast(&vp->v_cv);
1499                    if (!recycle) {
1500                            if (vtryrele(vp)) {
1501                                    mutex_exit(&vp->v_interlock);
(gdb) l *(0x00000000c03a15d9)
0xc03a15d9 is in vrelel (/home/src/src-5/src/sys/kern/vfs_subr.c:1567).
1562                    } else {
1563                            vp->v_freelisthd = &vnode_free_list;
1564                    }
1565                    TAILQ_INSERT_TAIL(vp->v_freelisthd, vp, v_freelist);
1566                    mutex_exit(&vnode_free_list_lock);
1567                    mutex_exit(&vp->v_interlock);
1568            }
1569    }

(side question: is it OK to put a vnode on a free list, while a loan is
still active ?)

soclose() got the socket lock, which happens to be the softnet_lock
(from a quick look, this looks OK; I couldn't find where in nfs or tcp
the socket's lock would be changed to somtheing else but softnet_lock).

So we are kpause()ing with softnet_lock held, and this is bad because
the cpu's softint thread can run, call one of the nfs or tcp timer routine
and sleep on the softnet_lock:
              19 3   0         4           d01585c0              slave livelock
              46 3   6       204           cd5bd860          softclk/6 tstile
               5 3   0       204           cd5a7500          softclk/0 tstile

db{0}> tr/a cd5bd860
trace: pid 0 lid 46 at 0xcece0bc0
sleepq_block(0,0,c0647bf4,c06a4a9c,cd5a4f80,c0724000,ced28a78,dc,40,1000001) at 
netbsd:sleepq_block+0xeb
turnstile_block(ced28a60,1,cd5a4f80,c06a4a9c,cec6724c,cece4000,cece0c80,ced28a60,0,0)
 at netbsd:turnstile_block+0x261
mutex_vector_enter(cd5a4f80,cece4000,cece0ce0,0,c0349cc2,cd5bd860,7,cece4000,cece4000,10c)
 at netbsd:mutex_vector_enter+0x317
tcp_timer_rexmt(c573e408,cd5bd860,7,cece4000,cecd6074,cece4060,cece4868,cece5068,cece5868,c0135fb0)
 at netbsd:tcp_timer_rexmt+0x1f
callout_softclock(0,10,30,10,10,0,156320,d01cca6c,30,cece0010) at 
netbsd:callout_softclock+0x25f
softint_dispatch(cd5bb0c0,2,0,0,0,0,cece0d90,cece0b68,cece0bc0,0) at 
netbsd:softint_dispatch+0x9f

And now that our softclk are sleeping waiting for the lock held by the nfsd
in kpause, the kpause will never be woken up and the kernel is locked.

Now the question is why this doesn't happen with the feb 2010 5.0_STABLE
kernel. nfsd_use_loan is also set to 1 in this kernel.
One thing that maybe is relevant because it defers more things
to a softnet_lock callout is tcp_input.c 1.291.4.3.

I guess this can happen with any TCP application using page loan, not only
nfsd.
Any idea on how to properly fix this ?
a workaround could be to use yield() in uvm_loan.c because it would not
require a clock tick to wake up.
I'm not sure if it's possible to drop the socket lock before unloaning the
pages.
But I wonder if this could be a more general issue with callouts.
Maybe we should have one thread per callout, a la fast_softint,
which is used when a callout needs to sleep ?

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--
Follow-Ups:
- Re: nfs server lockup (again)
  - From: Manuel Bouyer
Prev by Date: Re: boot problems with bge(4)
Next by Date: Re: nfs server lockup (again)
Previous by Thread: 5.x filesystem performance regression
Next by Thread: Re: nfs server lockup (again)
Indexes:
Home | Main Index | Thread Index | Old Index