At Mon, 10 Jun 2024 19:45:07 -0400, Brad Spencer <brad%anduin.eldar.org@localhost> wrote: Subject: Re: timekeeping regression? > > "Greg A. Woods" <woods%planix.ca@localhost> writes: > > > > It has recorded a drift value just 50.172, though oddly it is not > > updating the file hourly like ntp.conf(5) suggests it should be doing. > > It hasn't written the file since booting. None of my VMs have drift > > values over 66. > > I suspect it hasn't updated the file because the VM has lost sync with > the server. It appears that the drift file on mine is being updated. Ah, yes, it had not quite sunk in when I was looking at those logs that clock_sync was not being maintained for more than about a minute (if even that long -- probably only for the instant when the clock_sync log entry was generated), which I guess is more or less what could be expected given the way I (mis)configured this system for testing. Note that most of the rest of your analysis isn't really meaningful for this particular system as it is a deliberate test to see what happens when Xen is _not_ messing with the RDTSC instruction. (Note though that on my LAN I only intend to run one NTP server, so LAN clients normally only ever have this one local server configured, so all of that part is "normal" and as-intended.) Anyway, the result with tsc_mode=native is more or less matching what I would expect to see on bare multi-core hardware with an older CPU (as this is) if one forced a NetBSD kernel running full SMP to use TSC as its timecounter source. It does though also show that NTPd is remarkably persistent at trying to keep the clock in line if things aren't too wonky, as opposed to the main problem this thread is about where something suddenly goes far too wonky for domUs under the more recent Xen versions after 7 to 8 days of uptime, and where prior to that everything runs perfectly with no hint of any problem whatsoever. So I think my conclusion at the moment is that there's something happening with the RDTSC emulation, at least with tsc_mode=default, whereby suddenly a value is returned that causes the NetBSD clock to jump so wildly that NTPd immediately gives up. Unfortunately I'm not seeing anything obvious about what happened in the logs from ntpd, nor its state after the fact, when this occurs. Given FreeBSD's ability to withstand this event my current guess is that there's something wrong with the TSC frequency scaling code in NetBSD, but I'm at a total loss as to why it fails with only some versions of Xen. I'm still waiting to see if there's any difference with tsc_mode=always_emulate. That is being tested with a stock NetBSD-10 install and with NTPd using the pool servers. Only thing I forgot to adjust was ntpd's log levels so I won't see any clock_sync or no_sys_peer messages. -- Greg A. Woods <gwoods%acm.org@localhost> Kelowna, BC +1 250 762-7675 RoboHack <woods%robohack.ca@localhost> Planix, Inc. <woods%planix.com@localhost> Avoncote Farms <woods%avoncote.ca@localhost>
Attachment:
pgpnaxdSzZlCZ.pgp
Description: OpenPGP Digital Signature