Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: timekeeping regression? (a possible clue about the ~7.5 days)



[[ Note there's a question buried down there in my ramblings about what
   I should print in ntp_adjtime(2) to see what ntpd is doing. ]]

So some possible clues, or at least one new possible clue, for the weird
event that happens after ~7.5 days, or somewhere between about 650,000
and 660,000 seconds, of uptime.

It turns out the Xen watchdog reboots I've observed recently may have
been caused by these timekeeping problems, which of course makes sense
since the watchdog is essentially a timer.  With my latest version of
xen_clock.c both my machines have been rebooted after ~7.5 days by
xenwatchdog.

I've run xenwatchdogd with the (standard?) parameters "30 15".

What I've observed on both my local machines is a sudden change in the
direction of change of the delta, between Xen system_time (aka "Xen wall
clock time") and NetBSD nanouptime().

I noticed this because in my new xen_clock.c I added a printf() in
xen_rtc_set() to answer my question about whether xen_rtc_set() could
avoid using xen_glocal_systime_ns() (i.e. the Xen kernel's system_time)
by simply using nanouptime() instead.

One thing this printf() revealed is that there is a difference between
the initial nanouptime() value and what Xen knows as system_time, and of
course this is an expected thing since the dom0 kernel starts after the
Xen kernel is already running.

What I didn't expect is a constant drift between these values, though
perhaps I should have given how the code works.

Both the Xen kernel and the dom0 kernel should be using the same formula
for converting TSC values to nanoseconds so they shouldn't drift with
respect to each other on that account.

The only other explanation I can see for this drift is that it is the
time between when the Xen kernel sets the vcpu_time_info->system_time
value and when the dom0 kernel fetches that value and then submits it
back as the new system_time with XENPF_settime.

The execution time between those milestones is always "lost" (i.e. it is
unaccounted for), while no loss occurs in the timer used by nanouptime()
since it's using the un-adjusted "raw" TSC value.  (RDTSC is _not_
emulated for dom0)

(To me this may raise again the argument that dom0 should use
nanouptime() for xen_rtc_set() since that's the value directly related
to the wallclock time provided in tvp to xen_rtc_set().)

The "clue" to something going wrong after the ~7.5 days (between 650k
and 660k seconds) of uptime is seen in the change of direction of the
delta of change in the difference between system_time and nanouptime.
The first two lines are right after boot.  The next four lines show the
difference constantly still getting wider, as it has since boot, and
then suddenly in the sixth line it gets a lot narrower again, and then
it gets wider again, then the watchdog fires:

[  50.1593838] xen_rtc_set: Setting to 1722109487.404887000 s at systime 60784095871 ns (nanouptime: 50159594026 ns, diff(st-nt): 10.624501845 s)
[ 531.0352790] xen_rtc_set: Setting to 1722109968.280782000 s at systime 541627571059 ns (nanouptime: 531035319089 ns, diff(st-nt): 10.592251970 s)

[ 653317.1451660] xen_rtc_set: Setting to 1721447126.779810000 s at systime 653286589324921 ns (nanouptime: 653317145166034 ns, diff(st-nt): -30.555841113 s)
[ 653847.5469777] xen_rtc_set: Setting to 1721447657.181621000 s at systime 653816957323891 ns (nanouptime: 653847546977685 ns, diff(st-nt): -30.589653794 s)
[ 654378.3798177] xen_rtc_set: Setting to 1721448188.014461000 s at systime 654347756324820 ns (nanouptime: 654378379817734 ns, diff(st-nt): -30.623492914 s)
[ 654908.7496261] xen_rtc_set: Setting to 1721448718.384270000 s at systime 654878092324619 ns (nanouptime: 654908749626128 ns, diff(st-nt): -30.657301509 s)

[ 655438.5014603] xen_rtc_set: Setting to 1721449248.136104000 s at systime 655408810326817 ns (nanouptime: 655438501460300 ns, diff(st-nt): -29.691133483 s)

[ 655968.6892555] xen_rtc_set: Setting to 1721449778.323899000 s at systime 655938964326540 ns (nanouptime: 655968689255463 ns, diff(st-nt): -29.724928923 s)
[ 656499.3370809] xen_rtc_set: Setting to 1721450308.971725000 s at systime 656469578326883 ns (nanouptime: 656499337080864 ns, diff(st-nt): -29.758753981 s)
[ 657029.4058446] xen_rtc_set: Setting to 1721450839.040488000 s at systime 656999613326338 ns (nanouptime: 657029405844633 ns, diff(st-nt): -29.792518295 s)
[ 657559.9095515] xen_rtc_set: Setting to 1721451369.544195000 s at systime 657530083325040 ns (nanouptime: 657559909551465 ns, diff(st-nt): -29.826226425 s)
(XEN) [2024-07-20 04:57:24.044] Watchdog timer fired for domain 0
(XEN) [2024-07-20 04:57:24.044] Hardware Dom0 shutdown: watchdog rebooting machine


I'm not sure if/how ntpd may be involved in this.  I'd like to add some
printfs to ntp_adjtime(2) to see what NTP is doing, if anything, at the
time this "event" happens.  However I'm not sure what to print!  Can
anyone give me any hints please?

Note that when there was no reboot (as I'm hoping for again on one
machine where I've disabled xenwatchdogd) ntpd continues to keep the
system time in sync with real wall clock time, despite all these strange
changes in system_time to nanouptime() underneath.

I might reduce the default machdep.xen.timepush_ticks as well to get a
finer resolution on these changes right from boot.


On the other machine the change of direction happened twice, but after
the second change it continued to shrink quite quickly until
mDNSResponder noticed problems, then the watchdog finally fired.  Note
the slightly odd difference in this machine where the diff is
increasing.  This is probably caused by ntpd or ntpdate doing some
adjustment right after boot, since the first line is before they run,
and the second is after the login prompt is printed.

[  91.5819791] xen_rtc_set: Setting to 1721569164.078504000 s at systime 98520983672 ns (nanouptime: 91582082756 ns, diff(st-nt): 6.938900916 s)
[ 234.6170722] xen_rtc_set: Setting to 1720904503.415741000 s at systime 242014616921 ns (nanouptime: 234617115151 ns, diff(st-nt): 7.397501770 s)


[ 652774.4369065] xen_rtc_set: Setting to 1721557043.235576000 s at systime 652740334618425 ns (nanouptime: 652774436906471 ns, diff(st-nt): -34.102288046 s)
[ 653304.8010014] xen_rtc_set: Setting to 1721557573.599671000 s at systime 653270665618579 ns (nanouptime: 653304801001414 ns, diff(st-nt): -34.135382835 s)
[ 653835.6482080] xen_rtc_set: Setting to 1721558104.446877000 s at systime 653801479617526 ns (nanouptime: 653835648208012 ns, diff(st-nt): -34.168590486 s)
[ 654365.9794871] xen_rtc_set: Setting to 1721558634.778156000 s at systime 654331777619245 ns (nanouptime: 654365979487087 ns, diff(st-nt): -34.201867842 s)
[ 654895.7309103] xen_rtc_set: Setting to 1721559164.529580000 s at systime 654862495618958 ns (nanouptime: 654895730910325 ns, diff(st-nt): -33.235291367 s)
[ 655424.8833876] xen_rtc_set: Setting to 1721559693.682057000 s at systime 655392614618138 ns (nanouptime: 655424883387637 ns, diff(st-nt): -32.268769499 s)
[ 655887.4146385] xen_rtc_set: Setting to 1721560156.763236000 s at systime 655855116749974 ns (nanouptime: 655887414671239 ns, diff(st-nt): -32.297921265 s)
[ 655955.4068767] xen_rtc_set: Setting to 1721560224.755475000 s at systime 655923104929134 ns (nanouptime: 655955407181922 ns, diff(st-nt): -32.302252788 s)
[ 656485.4746822] xen_rtc_set: Setting to 1721560754.823280000 s at systime 656453139617880 ns (nanouptime: 656485475644011 ns, diff(st-nt): -32.336026131 s)
[ 657015.9620644] xen_rtc_set: Setting to 1721561285.310662000 s at systime 656983592616393 ns (nanouptime: 657015962442522 ns, diff(st-nt): -32.369826129 s)
[ 657342.1570541] xen_rtc_set: Setting to 1721561612.952759000 s at systime 657309766444447 ns (nanouptime: 657342157054086 ns, diff(st-nt): -32.390609639 s)
[ 657546.8242688] xen_rtc_set: Setting to 1721561817.619974000 s at systime 657514420618788 ns (nanouptime: 657546824268810 ns, diff(st-nt): -32.403650022 s)

[ 658076.1210560] xen_rtc_set: Setting to 1721562346.916761000 s at systime 658044683618063 ns (nanouptime: 658076121055977 ns, diff(st-nt): -31.437437914 s)

[ 658477.1406057] xen_rtc_set: Setting to 1721562748.933742000 s at systime 658445677616847 ns (nanouptime: 658477140605745 ns, diff(st-nt): -31.462988898 s)
[ 658606.7858700] xen_rtc_set: Setting to 1721562878.579007000 s at systime 658575314955443 ns (nanouptime: 658606786204718 ns, diff(st-nt): -31.471249275 s)

[ 659136.0016502] xen_rtc_set: Setting to 1721563407.794787000 s at systime 659105496618254 ns (nanouptime: 659136001650238 ns, diff(st-nt): -30.505031984 s)

[ 659665.6734376] xen_rtc_set: Setting to 1721563937.466574000 s at systime 659636134618881 ns (nanouptime: 659665673437575 ns, diff(st-nt): -29.538818694 s)
[ 660193.7440160] xen_rtc_set: Setting to 1721564465.537153000 s at systime 660166171618865 ns (nanouptime: 660193744015952 ns, diff(st-nt): -27.572397087 s)
[ 660462.9185721] xen_rtc_set: Setting to 1721564736.230970000 s at systime 660435329114526 ns (nanouptime: 660462918572069 ns, diff(st-nt): -27.589457543 s)
[ 660724.2247233] xen_rtc_set: Setting to 1721564997.537122000 s at systime 660696618618698 ns (nanouptime: 660724224723324 ns, diff(st-nt): -27.606104626 s)
[ 661255.0885441] xen_rtc_set: Setting to 1721565528.400943000 s at systime 661227448619059 ns (nanouptime: 661255088544137 ns, diff(st-nt): -27.639925078 s)
[ 661784.4703323] xen_rtc_set: Setting to 1721566057.782731000 s at systime 661757796617498 ns (nanouptime: 661784470332318 ns, diff(st-nt): -26.673714820 s)
[ 661856.2163227] xen_rtc_set: Setting to 1721566132.007573000 s at systime 661829538037048 ns (nanouptime: 661856216322686 ns, diff(st-nt): -26.678285638 s)
[ 662314.3011522] xen_rtc_set: Setting to 1721566590.092403000 s at systime 662288593619020 ns (nanouptime: 662314301152190 ns, diff(st-nt): -25.707533170 s)
[ 662843.5199313] xen_rtc_set: Setting to 1721567119.311182000 s at systime 662818779260842 ns (nanouptime: 662843520573410 ns, diff(st-nt): -24.741312568 s)
[ 663374.1197352] xen_rtc_set: Setting to 1721567649.910986000 s at systime 663349345121253 ns (nanouptime: 663374120237444 ns, diff(st-nt): -24.775116191 s)
[ 663904.1185640] xen_rtc_set: Setting to 1721568179.909815000 s at systime 663880309618763 ns (nanouptime: 663904118563975 ns, diff(st-nt): -23.808945212 s)
[ 664033.0198270] xen_rtc_set: Setting to 1721568311.810332000 s at systime 664009202669720 ns (nanouptime: 664033019827018 ns, diff(st-nt): -23.817157298 s)
[ 664431.6253604] xen_rtc_set: Setting to 1721568710.415866000 s at systime 664410782660994 ns (nanouptime: 664431625403916 ns, diff(st-nt): -20.842742922 s)
Jul 21 06:33:17 xenful mDNSResponder: mDNS_Execute,5348: mDNSPlatformRawTime went backwards by 915 ticks; setting correction factor to 1814271192
Jul 21 06:33:24 xenful mDNSResponder: mDNS_Execute,5348: mDNSPlatformRawTime went backwards by 919 ticks; setting correction factor to 1814272111
Jul 21 06:33:24 xenful mDNSResponder: mDNSCoreReceive,10552: mDNSPlatformRawTime went backwards by 416 ticks; setting correction factor to 1814272527
Jul 21 06:33:24 xenful mDNSResponder: mDNS_Execute,5348: mDNSPlatformRawTime went backwards by 853 ticks; setting correction factor to 1814273380
Jul 21 06:33:24 xenful mDNSResponder: mDNSCoreReceive,10552: mDNSPlatformRawTime went backwards by 33 ticks; setting correction factor to 1814273413
Jul 21 06:33:24 xenful mDNSResponder: mDNS_Execute,5348: mDNSPlatformRawTime went backwards by 689 ticks; setting correction factor to 1814274102
Jul 21 06:33:24 xenful mDNSResponder: mDNSCoreReceive,10552: mDNSPlatformRawTime went backwards by 887 ticks; setting correction factor to 1814274989
Jul 21 06:33:24 xenful mDNSResponder: mDNSCoreReceive,10552: mDNSPlatformRawTime went backwards by 431 ticks; setting correction factor to 1814275420
Jul 21 06:33:24 xenful mDNSResponder: mDNSCoreReceive,10552: mDNSPlatformRawTime went backwards by 72 ticks; setting correction factor to 1814275492
Jul 21 06:33:24 xenful mDNSResponder: mDNSCoreReceive,10552: mDNSPlatformRawTime went backwards by 890 ticks; setting correction factor to 1814276382
(XEN) [2024-07-21 13:33:51.929] Watchdog timer fired for domain 0
(XEN) [2024-07-21 13:33:51.929] Hardware Dom0 shutdown: watchdog rebooting machine

--
					Greg A. Woods <gwoods%acm.org@localhost>

Kelowna, BC     +1 250 762-7675           RoboHack <woods%robohack.ca@localhost>
Planix, Inc. <woods%planix.com@localhost>     Avoncote Farms <woods%avoncote.ca@localhost>

Attachment: pgpRC_GZSD5tX.pgp
Description: OpenPGP Digital Signature



Home | Main Index | Thread Index | Old Index