At Thu, 01 Feb 2024 19:31:40 -0500, Greg Troxel <gdt%lexort.com@localhost> wrote: Subject: Re: timekeeping regression? > > Greg Troxel <gdt%lexort.com@localhost> writes: > > > I update to RC3 (13 days ago says uptime). I am seeing the dom0 not > > really converged with ntp, offsets -5000 to -15000 (so 5 to 15s fast). > > I got some offlist hints and now I believe: > > this is not a recent regression > > a pv dom0 with multiple cpus keeps time badly > a pv or pvshim domU with multiple cpus keeps time badly > > > I have changed to vcpu=1 for both domUs and added to my dom0 boot line: > > dom0_max_vcpus=1 > > and now things seem much better. But, a PV domU netbsd-9 amd64 guest > with 2 cpus was mostly ok before. Coincidentally(?) I was just working on the same problem! I would say it _is_ a "recent" regression, and I think it is due only to Xen-4.18 (or perhaps a wee bit earlier, but not before Xen-4.13). It seems to be somewhat hardware and/or CPU dependent though. I have a pair of slightly different aged Dell PE2950's, with the main difference being that the one with time dom0 slip has an E5440@2.83GHz (xenful), running with 4 vCPUs, and it's so bad it can't even measure jitter on the LAN well: remote refid st t when poll reach delay offset jitter ============================================================================== xentastic.local 192.75.191.16 3 u 2 64 377 0.018 -656225 28783.5 and the one that keeps time OK has a X5460@3.16GHz (xentastic) with 8 vCPUs: remote refid st t when poll reach delay offset jitter ============================================================================== local-bcast.loc .BCST. 16 B - 64 0 0.000 +0.000 0.001 0.north-america .POOL. 16 p - 64 0 0.000 +0.000 0.001 1.north-america .POOL. 16 p - 64 0 0.000 +0.000 0.001 2.north-america .POOL. 16 p - 64 0 0.000 +0.000 0.001 3.north-america .POOL. 16 p - 64 0 0.000 +0.000 0.001 ca.pool.ntp.org .POOL. 16 p - 64 0 0.000 +0.000 0.001 0.netbsd.pool.n .POOL. 16 p - 64 0 0.000 +0.000 0.001 1.netbsd.pool.n .POOL. 16 p - 64 0 0.000 +0.000 0.001 2.netbsd.pool.n .POOL. 16 p - 64 0 0.000 +0.000 0.001 3.netbsd.pool.n .POOL. 16 p - 64 0 0.000 +0.000 0.001 +time13.nrc.ca 132.246.11.231 2 u 635 1024 377 63.361 -1.689 1.769 #time2.chu.nrc.c 209.87.233.52 2 u 355 1024 377 107.746 -3.463 12.118 -ntp1.torix.ca .PTP0. 1 u 823 1024 377 53.454 -4.014 1.083 -ntp2.torix.ca .PTP0. 1 u 381 1024 377 55.320 -3.261 1.880 -ntp3.torix.ca .PTP0. 1 u 395 1024 377 55.234 -5.710 1.380 +ntp1.yycix.ca 10.0.7.1 2 u 703 1024 377 21.605 -1.357 1.116 *ntp2.yycix.ca 10.0.16.1 2 u 515 1024 377 21.660 -1.837 1.102 I tried running NTP with external sources on xenful (the bad one), but that made no difference. It just can't keep time. It doesn't seem to be random in any way -- the one machine's dom0 can keep time, the other's cannot no matter what I try (I even tried timed, and it did better than NTP, but kept having to step time). I also have a Dell PE510 with E5645@2.40GHz and its dom0 with 4 vCPUs also keeps very good time, as do its domUs. They're all running the same Xen kernel (4.18 now), and the same NetBSD XEN3_DOM0 (a rather dated pre-10.0 -current), and indeed they all ran roughly the same NetBSD kernel before the Xen upgrade as well. Prior to the Xen upgrade they all kept good time (the 510 was running 4.11 and the two PE2950s were running 4.13). I'm now using the "good" PE2950 as the NTP primary for my local network. I cannot imagine why having only one vCPU in dom0 helps. It makes no sense given what (little) I know about how the Xen kernel timecounters and dom0 timecounters should interact. Another interesting point is now that I have a properly working NTP client configuration using just that one primary local NTP server all the domUs, including those on the older PE2950, and including one running FreeBSD 14.0 PVH (on the newer PE2950), are all keeping good time. This is a domU running on xenful, the "bad" PE2950, after almost 4 days of uptime: remote refid st t when poll reach delay offset jitter ============================================================================== *xentastic.local 206.108.0.132 2 u 808 1024 377 0.125 -27.501 18.602 According to some discussions I found on the internets the Xen kernel itself could/should now use TSC on systems where TSC is either effectively or actually stable across cores (and I think it will automatically in certain conditions where TSC is guaranteed safe), and so I tried appending "clocksource=tsc tsc=stable:socket" to the Xen command line on the two PE2950s, but that didn't seem to change anything (i.e. the Xen platform timer stays configured as HPET). The most recent Xen code says, as I read it, that it should print a debug message if the configured clocksource isn't valid, but even with "loglvl=all" there's no such message appearing. More digging and debugging to do. I really would like to get Xen using TSC as its platform timer and see if that makes any difference. The only other VM I can't get to keep good time at all is a VirtualBox one running on my mac Pro, and apparently only VB guest drivers and the VB guest service daemon can properly fix that. -- Greg A. Woods <gwoods%acm.org@localhost> Kelowna, BC +1 250 762-7675 RoboHack <woods%robohack.ca@localhost> Planix, Inc. <woods%planix.com@localhost> Avoncote Farms <woods%avoncote.ca@localhost>
Attachment:
pgp9vgDDmskPz.pgp
Description: OpenPGP Digital Signature