[ Adding tech-kern. The relevant earlier mails start at http://mail-index.netbsd.org/current-users/2015/10/19/msg028233.html This is about a default-installed amd64 GENERIC 7.0 kernel. Replies are better in tech-kern, I think, so I set Reply-To accordingly. ] On Fri 23 Oct 2015 at 00:46:57 +0200, Rhialto wrote: > This problem is very repeatable, usually within a few hours, just now it > happened within half an hour. > > It seems to me that somehow the nfs_reqq list gets corrupted. Then > either there is a crash when traversing it in nfs_timer() (occurring in > nfs_sigintr() due to being called with a bogus pointer), or there is a > hang when one of the NFS requests gets lost and never retried. Looking into this: the occurrences of nfs_reqq are as follows: fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the nfs_reqq list. Since there is no other mention of nfs_reqq_mtx in the whole syssrc tarball, this looks wrong. It also immediately causes the suspicion that the list isn't in fact protected at all. nfs/nfs.h:extern TAILQ_HEAD(nfsreqhead, nfsreq) nfs_reqq; nfs/nfs_clntsocket.c: TAILQ_FOREACH(rep, &nfs_reqq, r_chain) { nfs/nfs_clntsocket.c: TAILQ_INSERT_TAIL(&nfs_reqq, rep, r_chain); nfs/nfs_clntsocket.c: TAILQ_REMOVE(&nfs_reqq, rep, r_chain); Protected with s = splsoftnet(); for match #2 and #3 but #1 seems not protected by anything I can see nearby. Maybe it is error = nfs_rcvlock(nmp, myrep); if that makes any sense. That function definitely does not use either splsoftnet() OR mutex_enter(softnet_lock). nfs/nfs_socket.c:struct nfsreqhead nfs_reqq; nfs/nfs_socket.c: TAILQ_FOREACH(rp, &nfs_reqq, r_chain) { nfs/nfs_socket.c: TAILQ_FOREACH(rep, &nfs_reqq, r_chain) { match #3 is protected with mutex_enter(softnet_lock); /* XXX PR 40491 */ but none of the others (visibly nearby). #2 is called from nfs_receive() which uses nfs_sndlock() which also doesn't use either splsoftnet() OR mutex_enter(softnet_lock). nfs/nfs_subs.c: TAILQ_INIT(&nfs_reqq); presumably doesn't need any extra protection. softnet_lock is allocated as ./kern/uipc_socket.c:kmutex_t *softnet_lock; ./kern/uipc_socket.c: softnet_lock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE); IPL_NONE seems inconsistent with splsoftnet(). I never studied the inner details of kernel locking, but the diversity of protections of this list doesn't inspire trust at first sight... -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl -- 'this bath is too hot.'
Attachment:
signature.asc
Description: PGP signature