Re: Xen domU timekeeping regression

To: NetBSD/xen Discussion List <port-xen%NetBSD.org@localhost>
Subject: Re: Xen domU timekeeping regression
From: "Greg A. Woods" <woods%planix.ca@localhost>
Date: Tue, 25 Jun 2024 11:11:33 -0700

At Tue, 25 Jun 2024 03:14:09 +0000, "Mathew, Cherry G." <c%bow.st@localhost> wrote:
Subject: Re: Xen timecounter issues
>
> >>>>> On Mon, 24 Jun 2024 16:02:04 -0700, "Greg A. Woods" <woods%planix.ca@localhost> said:
>
> > I did observe dom0 having problems without dom0_vcpus_pin=true on my
> > older hardware, but that was only a brief test.  As soon as I pinned the
> > vCPUs, even at runtime, the problems disappeared.
>
> So this "solution" worked only for dom0 ?

The dom0 with un-pinned vCPUs had entirely different symptoms than the
problematic domUs, so "worked only for dom0" isn't quite the correct
characterization.

On dom0 I was seeing "xen raw systime + tsc delta went backwards"
(i.e. ci_xen_raw_systime_backwards_evcnt)

I didn't let it run long enough in that state for ntpd to get into
trouble -- and indeed that system has kept perfect time again since its
vCPUs were (re)pinned.

I more or less expected this to happen in dom0 for my current hardware
whenever raw/native RDTSC is used.

I did test a domU with tsc_mode=native (and unpinned) for a longer
period and saw similar bad behaviour, with its ntpd holding on for some
time but using some clock stepping and eventually giving up.

> Did you try pinning them for domU that were drifting ?

I had not tried that.  I didn't see how vCPU pinning for a domU would
make any difference because:

1. the domUs run fine for as much as ~7.5 days unpinned

2. the TSC is supposed to be entirely emulated for the domUs (on my
   hardware), so unless there are (new?) bugs in the Xen TSC emulation
   code that are somehow dependent on ~7.5 days of runtime having
   passed, but which don't affect FreeBSD, and these bugs would be
   somehow be avoided if the domU vCPUs were pinned....

However it did have an effect!  Temporary for one domU, ongoing success
for another.

Here's how I've pinned vCPUs for now on this machine (dom0 pinned at
boot with dom0_vcpus_pin=true):

Name                                ID  VCPU   CPU State   Time(s) Affinity (Hard / Soft)
Domain-0                             0     0    0   -b-   10245.4  0 / all
Domain-0                             0     1    1   r--    3181.9  1 / all
Domain-0                             0     2    2   -b-    1734.9  2 / all
Domain-0                             0     3    3   -b-    1727.0  3 / all
Domain-0                             0     4    4   -b-    1990.5  4 / all
Domain-0                             0     5    5   -b-    2067.2  5 / all
Domain-0                             0     6    6   -b-    2071.7  6 / all
Domain-0                             0     7    7   -b-    2164.6  7 / all
fezzik                               2     0    6   -b-     919.5  all / all
fezzik                               2     1    2   -b-     753.2  all / all
fezzik                               2     2    7   -b-     730.0  all / all
fezzik                               2     3    5   -b-     692.5  all / all
nb10                                 3     0    4   -b-     892.4  4 / all
nb10                                 3     1    5   -b-     661.2  5 / all
nb10                                 3     2    6   -b-     678.6  6 / all
nb10                                 3     3    7   -b-     948.1  7 / all
nbtest                               9     0    0   -b-    2309.8  0 / all
nbtest                               9     1    1   -b-    3166.0  1 / all
nbtest                               9     2    2   -b-    2996.5  2 / all
nbtest                               9     3    3   -b-    2864.2  3 / all

fezzik is the FreeBSD-14 VM -- it's still keeping perfect time with it's
ntpd running 100% in clock_sync.

nb10 is the stock NetBSD-10.0.  It's ntpd was able to regain clock_sync
after stopping it, running ntpdate, and restarting it, and it has
maintained clock_sync since, now for about 12 hours.

nbtest was able to regain clock_sync after the same stop, run ntpdate
and restart, but it lost it again after about an hour.  It regained
clock_sync briefly a couple more times with some massive clock steps,
but has been without for several hours now.

So pinning domU vCPUs does change something.

Reading the Xen code again I think I understand it now to only be
scaling the result of RDTSC to the 1 GHz rate, so it is still using the
per-CPU TSC value (that won't be invariant across CPUs in my hardware).

So that doesn't make sense to me then as to why my nbtest VM didn't
experience full recovery once its vCPUs were pinned.

And it doesn't really explain the ~7.5 days.

--
					Greg A. Woods <gwoods%acm.org@localhost>

Kelowna, BC     +1 250 762-7675           RoboHack <woods%robohack.ca@localhost>
Planix, Inc. <woods%planix.com@localhost>     Avoncote Farms <woods%avoncote.ca@localhost>

Attachment: pgpTt6S6GaTKe.pgp
Description: OpenPGP Digital Signature

References:
- Xen timecounter issues
  - From: Taylor R Campbell
- Re: Xen timecounter issues
  - From: Manuel Bouyer
- Re: Xen timecounter issues
  - From: Greg A. Woods
- Re: Xen timecounter issues
  - From: Mathew, Cherry G.

Prev by Date: Re: Xen timecounter issues
Next by Date: Re: [patch] Re: proposal: stop using the xen_system_time timecounter in dom0
Previous by Thread: Re: Xen timecounter issues
Indexes:

Home | Main Index | Thread Index | Old Index