So, I know now why we want to use "dom0_vcpus_pin=true" w.r.t. timekeeping! I updated xen_clock.c to 1.18 and turned on XEN_CLOCK_DEBUG (and then commented out one of the super-noisy device_printf() calls that actually caused the system to hang) and I started seeing thousands of printfs like the following, but only on dom0, and only on the one machine where I didn't have dom0's CPUs pinned. [ 83329.4245423] xen raw systime + tsc delta went backwards: 82591317579681 > 82591299251748 [ 83329.4245423] raw_systime_ns=82590641756625 [ 83329.4245423] tsc_timestamp=233578790859082 [ 83329.4245423] tsc=233580649104491 [ 83329.4245423] tsc_to_system_mul=3039340271 [ 83329.4245423] tsc_shift=-1 [ 83329.4245423] delta_tsc=1858245409 [ 83329.4245423] delta_ns=657495123 Make that hundreds of thousands in less than a day: # uptime 4:33PM up 23:26, 2 users, load averages: 0.08, 0.02, 0.01 vcpu0 raw systime went backwards 395276 4 intr vcpu0 missed hardclock 423534 5 intr vcpu0 timecounter went backwards 242583 2 intr vcpu1 raw systime went backwards 261025 3 intr vcpu1 missed hardclock 462819 5 intr vcpu1 timecounter went backwards 256918 3 intr Also time drifted..... # ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== xentastic.local 192.75.191.16 3 u 105 256 377 0.469 -724851 6055.97 So I pinned them at runtime: # xl vcpu-list Domain-0 Name ID VCPU CPU State Time(s) Affinity (Hard / Soft) Domain-0 0 0 3 r-- 806.0 all / all Domain-0 0 1 2 -b- 715.6 all / all # xl vcpu-pin 0 0 0 # xl vcpu-pin 0 1 1 # xl vcpu-list Domain-0 Name ID VCPU CPU State Time(s) Affinity (Hard / Soft) Domain-0 0 0 0 -b- 807.9 0 / all Domain-0 0 1 1 r-- 716.6 1 / all And voila! Instantly no more raw system time going backwards events! Also ntpd is again able to hold the clock stable again (after a reset step by ntpdate). I thought this might be because there's no way (that I know) to set the tsc_mode for dom0, but given that the tsc_to_system_mul shown in the debug printf is about what it should be to round down to 1GHz on this machine then it seems RDTSC must be being emulated. I guess the RDTSC emulation must not be stable across CPUs? Or? Now I wait some days again to see if the newest xen_clock.c gives me any more clues as to why, if it still happens, that domU clocks begin to drift after ~7.5 days of uptime..... -- Greg A. Woods <gwoods%acm.org@localhost> Kelowna, BC +1 250 762-7675 RoboHack <woods%robohack.ca@localhost> Planix, Inc. <woods%planix.com@localhost> Avoncote Farms <woods%avoncote.ca@localhost>
Attachment:
pgpJQJopn1wCq.pgp
Description: OpenPGP Digital Signature