OK, so I noticed that the Xen domUs which were losing track of time would do so only after about 7 to 8 days of uptime. I noticed in logs that the system's clock started wandering off not long before ntpd reports "no_sys_peer": May 24 04:08:07 more ntpd[4132]: 0.0.0.0 0618 08 no_sys_peer Note this system was booted May 16 @19:30UTC. Ntpd is still running, but never getting the system back in sync. I'm guessing the underlying system clock drifts suddenly and it never gets close enough to allow it to take control again. Restarting ntpd, even after forcing the clock back in line with "ntpdate", has never proved successful. The clock wanders almost immediately and ntpd never gets it back in sync and never gives a new "sys_peer" log entry. Rebooting the domU doesn't help -- the clock wanders almost immediately. I happen to run mDNSResponder on some of these domUs, and here's one complaining immediately after a reboot, even before ntpd gets started: Jun 6 13:36:26 nbtcur mDNSResponder: mDNS_Execute,5348: mDNSPlatformRawTime went backwards by 331 ticks; setting correction factor to 2706829871 Only a hard reboot of the whole system (dom0 & Xen) fixes the problem (temporarily -- for somewhere between 7 and 8 days). Now all my domUs are running with the default "tsc_mode=0", so given that the "xen_system_time" timecounter is using the "rdtsc" instruction, perhaps there's something happening in the Xen hypervisor after 7 to 8 days of uptime that for some reason changes what it's doing with the emulated "rdtsc", possibly switching from emulated to not emulated. On the machine with no problems: (XEN) [2024-06-06 23:51:13.947] TSC marked as reliable, warp = 0 (count=2) (XEN) [2024-06-06 23:51:13.947] dom3: mode=0,ofs=0xe56bf274c,khz=2400085,inc=1 (XEN) [2024-06-06 23:51:13.947] dom8: mode=0,ofs=0xf87c323f18a2f,khz=2400085,inc=1 (XEN) [2024-06-06 23:51:13.947] dom16(hvm): mode=0,ofs=0x1ee3b7725b74d0,khz=2400085,inc=1 (XEN) [2024-06-06 23:51:13.947] dom18: mode=0,ofs=0x208d9546839caa,khz=2400085,inc=1 On the machines with problems: (XEN) [2024-06-06 23:47:44.347] TSC has constant rate, deep Cstates possible, so not reliable, warp=4200 (count=1) (XEN) [2024-06-06 23:47:44.347] dom1: mode=0,ofs=0xc359a2a1c,khz=2826252,inc=1 (XEN) [2024-06-06 23:47:44.347] dom3: mode=0,ofs=0x675b93e6041bd,khz=2826252,inc=1 (XEN) [2024-06-06 23:50:38.869] TSC has constant rate, deep Cstates possible, so not reliable, warp=2081 (count=2) (XEN) [2024-06-06 23:50:38.869] dom3(hvm): mode=0,ofs=0x22444ce5c4c21,khz=3158786,inc=1 (XEN) [2024-06-06 23:50:38.869] dom4: mode=0,ofs=0x2ca29a9c899a9,khz=3158786,inc=1 The "hvm" domUs are running FreeBSD and have no problems. They're using what they call XENTIMER as their timecounter clock source. The FreeBSD code is very different, at least on first glance, and somewhat more convoluted in some ways. I don't see an obvious "rdtsc" instruction being used, but there are hints that's what it is doing, but I may be wrong. Anyway I'm going to try "tsc_mode=1" (always emulate) on the NetBSD domUs next.... -- Greg A. Woods <gwoods%acm.org@localhost> Kelowna, BC +1 250 762-7675 RoboHack <woods%robohack.ca@localhost> Planix, Inc. <woods%planix.com@localhost> Avoncote Farms <woods%avoncote.ca@localhost>
Attachment:
pgp5CNdm6GNvV.pgp
Description: OpenPGP Digital Signature