At Tue, 25 Jun 2024 03:14:09 +0000, "Mathew, Cherry G." <c%bow.st@localhost> wrote: Subject: Re: Xen timecounter issues > > >>>>> On Mon, 24 Jun 2024 16:02:04 -0700, "Greg A. Woods" <woods%planix.ca@localhost> said: > > > I did observe dom0 having problems without dom0_vcpus_pin=true on my > > older hardware, but that was only a brief test. As soon as I pinned the > > vCPUs, even at runtime, the problems disappeared. > > So this "solution" worked only for dom0 ? The dom0 with un-pinned vCPUs had entirely different symptoms than the problematic domUs, so "worked only for dom0" isn't quite the correct characterization. On dom0 I was seeing "xen raw systime + tsc delta went backwards" (i.e. ci_xen_raw_systime_backwards_evcnt) I didn't let it run long enough in that state for ntpd to get into trouble -- and indeed that system has kept perfect time again since its vCPUs were (re)pinned. I more or less expected this to happen in dom0 for my current hardware whenever raw/native RDTSC is used. I did test a domU with tsc_mode=native (and unpinned) for a longer period and saw similar bad behaviour, with its ntpd holding on for some time but using some clock stepping and eventually giving up. > Did you try pinning them for domU that were drifting ? I had not tried that. I didn't see how vCPU pinning for a domU would make any difference because: 1. the domUs run fine for as much as ~7.5 days unpinned 2. the TSC is supposed to be entirely emulated for the domUs (on my hardware), so unless there are (new?) bugs in the Xen TSC emulation code that are somehow dependent on ~7.5 days of runtime having passed, but which don't affect FreeBSD, and these bugs would be somehow be avoided if the domU vCPUs were pinned.... However it did have an effect! Temporary for one domU, ongoing success for another. Here's how I've pinned vCPUs for now on this machine (dom0 pinned at boot with dom0_vcpus_pin=true): Name ID VCPU CPU State Time(s) Affinity (Hard / Soft) Domain-0 0 0 0 -b- 10245.4 0 / all Domain-0 0 1 1 r-- 3181.9 1 / all Domain-0 0 2 2 -b- 1734.9 2 / all Domain-0 0 3 3 -b- 1727.0 3 / all Domain-0 0 4 4 -b- 1990.5 4 / all Domain-0 0 5 5 -b- 2067.2 5 / all Domain-0 0 6 6 -b- 2071.7 6 / all Domain-0 0 7 7 -b- 2164.6 7 / all fezzik 2 0 6 -b- 919.5 all / all fezzik 2 1 2 -b- 753.2 all / all fezzik 2 2 7 -b- 730.0 all / all fezzik 2 3 5 -b- 692.5 all / all nb10 3 0 4 -b- 892.4 4 / all nb10 3 1 5 -b- 661.2 5 / all nb10 3 2 6 -b- 678.6 6 / all nb10 3 3 7 -b- 948.1 7 / all nbtest 9 0 0 -b- 2309.8 0 / all nbtest 9 1 1 -b- 3166.0 1 / all nbtest 9 2 2 -b- 2996.5 2 / all nbtest 9 3 3 -b- 2864.2 3 / all fezzik is the FreeBSD-14 VM -- it's still keeping perfect time with it's ntpd running 100% in clock_sync. nb10 is the stock NetBSD-10.0. It's ntpd was able to regain clock_sync after stopping it, running ntpdate, and restarting it, and it has maintained clock_sync since, now for about 12 hours. nbtest was able to regain clock_sync after the same stop, run ntpdate and restart, but it lost it again after about an hour. It regained clock_sync briefly a couple more times with some massive clock steps, but has been without for several hours now. So pinning domU vCPUs does change something. Reading the Xen code again I think I understand it now to only be scaling the result of RDTSC to the 1 GHz rate, so it is still using the per-CPU TSC value (that won't be invariant across CPUs in my hardware). So that doesn't make sense to me then as to why my nbtest VM didn't experience full recovery once its vCPUs were pinned. And it doesn't really explain the ~7.5 days. -- Greg A. Woods <gwoods%acm.org@localhost> Kelowna, BC +1 250 762-7675 RoboHack <woods%robohack.ca@localhost> Planix, Inc. <woods%planix.com@localhost> Avoncote Farms <woods%avoncote.ca@localhost>
Attachment:
pgpTt6S6GaTKe.pgp
Description: OpenPGP Digital Signature