[...]
So something seems to be able to lock out the clock from interrupting
for rather long times. Clock interrupts happens at IPL_SCHED, so
this would seem to mean that the system is running at IPL_HIGH for
extended periods (I don't think anything is using the IPL levels in
between).
Anyone have any ideas what this might be?
I don't, but I have an idea that might help you figure out what it is.
On hardclock entry, walk back up the stack far enough to find the PC
saved by the hardclock interrupt. This should be immediately after the
relevant splx(); if you're lucky, this will be sufficiently non-generic
to help.
Yes, walking the stack like this is MD, but for debugging an MD problem
I would not consider that a problem.