But whatever the actual bug here is, when NFS is involved we need proper pkt traces to analyze it. You can only see things tstile on a vnode lock in the kernel backtraces or lockdebug output, but you can not see why the IO does not complete and the lock is not freed.
It wouldn't be easy to get a pcap, but it's possible. It sometimes takes a week, sometimes a day.
What's strange is that that shortly after the issue, I ran a kernel with LOCKDEBUG. Had no issues for more than a month. Since there was a lot of overhead with that, I removed that, and got a lockup via NFS within four days.
The same NFS server (amd64, NetBSD 8) serves an Amiga and an Alpha without problems aside from the occasional "nfs_reply: ignoring error 55".
The previous NFS server was a Raspberry Pi also running NetBSD 8 which serves m68k and PowerPC Macs, VAXen and various ARM SBCs, all running NetBSD 8 or current.
The previous UltraSPARC machine was a Sun Fire v100 with 100 Mbps tlp* interfaces. Had no issues over the course of many months.
This machine has bge* interfaces, which could be buggy, but so do the Alpha and amd64 systems, and they've had no problems.
So it looks like it's an issue with running this system multiprocessor. But how would one diagnose this better when enabling LOCKDEBUG causes the problem to go away?
I'm going to run it with tcpdump running on the NFS interfaces on both ends and see if it happens again. On the other hand, I can't imagine this being an NFS issue that happens nowhere except multiprocessor UltraSPARC. I'm open to the possibility, though :)
Thanks, John