Re: timekeeping regression?

To: NetBSD/xen Discussion List <port-xen%NetBSD.org@localhost>
Subject: Re: timekeeping regression?
From: "Greg A. Woods" <woods%planix.ca@localhost>
Date: Thu, 06 Jun 2024 21:43:46 -0700

OK, so I noticed that the Xen domUs which were losing track of time
would do so only after about 7 to 8 days of uptime.

I noticed in logs that the system's clock started wandering off not long
before ntpd reports "no_sys_peer":

    May 24 04:08:07 more ntpd[4132]: 0.0.0.0 0618 08 no_sys_peer

Note this system was booted May 16 @19:30UTC.

Ntpd is still running, but never getting the system back in sync.  I'm
guessing the underlying system clock drifts suddenly and it never gets
close enough to allow it to take control again.

Restarting ntpd, even after forcing the clock back in line with
"ntpdate", has never proved successful.  The clock wanders almost
immediately and ntpd never gets it back in sync and never gives a new
"sys_peer" log entry.

Rebooting the domU doesn't help -- the clock wanders almost immediately.

I happen to run mDNSResponder on some of these domUs, and here's one
complaining immediately after a reboot, even before ntpd gets started:

    Jun  6 13:36:26 nbtcur mDNSResponder: mDNS_Execute,5348: mDNSPlatformRawTime went backwards by 331 ticks; setting correction factor to 2706829871

Only a hard reboot of the whole system (dom0 & Xen) fixes the problem
(temporarily -- for somewhere between 7 and 8 days).


Now all my domUs are running with the default "tsc_mode=0", so given
that the "xen_system_time" timecounter is using the "rdtsc" instruction,
perhaps there's something happening in the Xen hypervisor after 7 to 8
days of uptime that for some reason changes what it's doing with the
emulated "rdtsc", possibly switching from emulated to not emulated.

On the machine with no problems:

(XEN) [2024-06-06 23:51:13.947] TSC marked as reliable, warp = 0 (count=2)
(XEN) [2024-06-06 23:51:13.947] dom3: mode=0,ofs=0xe56bf274c,khz=2400085,inc=1
(XEN) [2024-06-06 23:51:13.947] dom8: mode=0,ofs=0xf87c323f18a2f,khz=2400085,inc=1
(XEN) [2024-06-06 23:51:13.947] dom16(hvm): mode=0,ofs=0x1ee3b7725b74d0,khz=2400085,inc=1
(XEN) [2024-06-06 23:51:13.947] dom18: mode=0,ofs=0x208d9546839caa,khz=2400085,inc=1


On the machines with problems:

(XEN) [2024-06-06 23:47:44.347] TSC has constant rate, deep Cstates possible, so not reliable, warp=4200 (count=1)
(XEN) [2024-06-06 23:47:44.347] dom1: mode=0,ofs=0xc359a2a1c,khz=2826252,inc=1
(XEN) [2024-06-06 23:47:44.347] dom3: mode=0,ofs=0x675b93e6041bd,khz=2826252,inc=1


(XEN) [2024-06-06 23:50:38.869] TSC has constant rate, deep Cstates possible, so not reliable, warp=2081 (count=2)
(XEN) [2024-06-06 23:50:38.869] dom3(hvm): mode=0,ofs=0x22444ce5c4c21,khz=3158786,inc=1
(XEN) [2024-06-06 23:50:38.869] dom4: mode=0,ofs=0x2ca29a9c899a9,khz=3158786,inc=1


The "hvm" domUs are running FreeBSD and have no problems.  They're using
what they call XENTIMER as their timecounter clock source.  The FreeBSD
code is very different, at least on first glance, and somewhat more
convoluted in some ways.  I don't see an obvious "rdtsc" instruction
being used, but there are hints that's what it is doing, but I may be
wrong.

Anyway I'm going to try "tsc_mode=1" (always emulate) on the NetBSD
domUs next....

--
					Greg A. Woods <gwoods%acm.org@localhost>

Kelowna, BC     +1 250 762-7675           RoboHack <woods%robohack.ca@localhost>
Planix, Inc. <woods%planix.com@localhost>     Avoncote Farms <woods%avoncote.ca@localhost>

Attachment: pgp5CNdm6GNvV.pgp
Description: OpenPGP Digital Signature

Follow-Ups:
- Re: timekeeping regression?
  - From: Greg A. Woods

References:
- Re: timekeeping regression?
  - From: Mathew, Cherry G.*
- Re: timekeeping regression?
  - From: Brad Spencer
- Re: timekeeping regression?
  - From: Greg A. Woods

Prev by Date: Xen 4.18 corrupts a Linux domU
Next by Date: Re: timekeeping regression?
Previous by Thread: Re: timekeeping regression?
Next by Thread: Re: timekeeping regression?
Indexes:

Home | Main Index | Thread Index | Old Index