So, I had a weird thing happen on one of my regularly used Xen-hosted virtual servers this morning... The host is a Dell PE2950, running Xen-4.5 with 8.99.32 amd64. The domU is also 8.99.32 amd64. (I'm in the slow process of upgrading packages so I can upgrade Xen, but that's not been completed quite yet.) From all observations on shell and xterm sessions to the domU just seemed very sluggish and sometimes non-responsive. "systat vm" reported a very high "interrupts" count, and a rather high "sys" CPU use. The console seemed completely dead, but had reported a stream of messages like: [Thu May 9 09:24:08 2019][ 6442662.0806318] route_enqueue: queue full, dropped message There were thousands of identical lines, all separated by a few microseconds. No doubt this spew was the real cause of the apparent interrupt storm and the resulting sluggishness. The other domUs and the dom0 seemed A-OK. So I decided to reboot it from the dom0 and it did the right thing: [Thu May 9 10:09:46 2019][ 6445400.3265991] xenbus_shutdown_handler: xenbus_rm 13 [Thu May 9 10:09:46 2019]May 9 10:09:46 future shutdown: poweroff by root: power button pressed [Thu May 9 10:10:05 2019]May 9 10:10:05 future syslogd[155]: Exiting on signal 15 [Thu May 9 10:10:40 2019][ 6445454.6233182] syncing disks... 2 done [Thu May 9 10:10:40 2019][ 6445454.8073215] unmounting 0xffffbe00102cb008 /more/archive (more.local:/archive)... [Thu May 9 10:10:40 2019][ 6445454.9233295] ok [Thu May 9 10:10:40 2019][ 6445454.9233295] unmounting 0xffffbe00102c6008 /more/home (more.local:/home)... But "Because NFS" it stuck there trying to unmount /home and I ended up typing the unfortunate command: xl destroy future I've never had to be quite so emphatic before! :-) However rebooting got the "future" running quite happily again! As mentioned it's been taking a while to upgrade, and the whole Xen server and all its production domains has been running for 87 days. However when I looked back through the console log I was surprised to find another blast of these messages from two months ago (after nearly a month of uptime). However that spew stopped without me knowingly intervening, after nearly 7000 lines (but just 20 seconds elapsed), though curiously there's another odd message within seconds of the spew stopping. [Wed Mar 20 16:19:01 2019][ 2147554.5719048] route_enqueue: queue full, dropped message [Wed Mar 20 16:19:09 2019][ 2147562.9851727] pid 28947 (emacs): user write of 1019904@0x3640000 at 48052784 failed: 28 If that last message is from a core dump, it might have been caused by the route_enquue problem (because it lost its X11 connection and emacs likes to dump core when that happens), or it might have caused the problem since it would have been dumping to an NFS server (because emacs on rare occasions ups and dumps core when you least expect it to, though thankfully far less so in recent releases). Today though I don't think there was a core dump -- I was using two different emacs sessions on that host while experiencing the sluggish behaviour right up until it got too sluggish to use. There are no other interesting messages in my console logs. Does anyone have any clues/suggestions/questions for me? -- Greg A. Woods <gwoods%acm.org@localhost> +1 250 762-7675 RoboHack <woods%robohack.ca@localhost> Planix, Inc. <woods%planix.com@localhost> Avoncote Farms <woods%avoncote.ca@localhost>
Attachment:
pgpDmL5dw7PeG.pgp
Description: OpenPGP Digital Signature