pkgsrc-Changes archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
CVS commit: pkgsrc/sysutils/xenkernel415
Module Name: pkgsrc
Committed By: bouyer
Date: Fri Jun 24 13:07:52 UTC 2022
Modified Files:
pkgsrc/sysutils/xenkernel415: Makefile distinfo
Added Files:
pkgsrc/sysutils/xenkernel415/patches: patch-XSA397 patch-XSA398
patch-XSA399 patch-XSA400 patch-XSA401 patch-XSA402 patch-XSA404
Log Message:
Apply patches for Xen security advisory 397 up to 402, and 404 (XSA403 still
not released).
Bump PKGREVISION
To generate a diff of this commit:
cvs rdiff -u -r1.5 -r1.6 pkgsrc/sysutils/xenkernel415/Makefile \
pkgsrc/sysutils/xenkernel415/distinfo
cvs rdiff -u -r0 -r1.1 pkgsrc/sysutils/xenkernel415/patches/patch-XSA397 \
pkgsrc/sysutils/xenkernel415/patches/patch-XSA398 \
pkgsrc/sysutils/xenkernel415/patches/patch-XSA399 \
pkgsrc/sysutils/xenkernel415/patches/patch-XSA400 \
pkgsrc/sysutils/xenkernel415/patches/patch-XSA401 \
pkgsrc/sysutils/xenkernel415/patches/patch-XSA402 \
pkgsrc/sysutils/xenkernel415/patches/patch-XSA404
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Modified files:
Index: pkgsrc/sysutils/xenkernel415/Makefile
diff -u pkgsrc/sysutils/xenkernel415/Makefile:1.5 pkgsrc/sysutils/xenkernel415/Makefile:1.6
--- pkgsrc/sysutils/xenkernel415/Makefile:1.5 Sat Apr 30 00:21:15 2022
+++ pkgsrc/sysutils/xenkernel415/Makefile Fri Jun 24 13:07:52 2022
@@ -1,9 +1,9 @@
-# $NetBSD: Makefile,v 1.5 2022/04/30 00:21:15 khorben Exp $
+# $NetBSD: Makefile,v 1.6 2022/06/24 13:07:52 bouyer Exp $
VERSION= 4.15.2
DISTNAME= xen-${VERSION}
PKGNAME= xenkernel415-${VERSION}
-PKGREVISION= 1
+PKGREVISION= 2
CATEGORIES= sysutils
MASTER_SITES= https://downloads.xenproject.org/release/xen/${VERSION}/
DIST_SUBDIR= xen415
Index: pkgsrc/sysutils/xenkernel415/distinfo
diff -u pkgsrc/sysutils/xenkernel415/distinfo:1.5 pkgsrc/sysutils/xenkernel415/distinfo:1.6
--- pkgsrc/sysutils/xenkernel415/distinfo:1.5 Fri Mar 4 17:54:08 2022
+++ pkgsrc/sysutils/xenkernel415/distinfo Fri Jun 24 13:07:52 2022
@@ -1,9 +1,16 @@
-$NetBSD: distinfo,v 1.5 2022/03/04 17:54:08 bouyer Exp $
+$NetBSD: distinfo,v 1.6 2022/06/24 13:07:52 bouyer Exp $
BLAKE2s (xen415/xen-4.15.2.tar.gz) = f6e3d354a144c9ff49a198ebcafbd5e8a4414690d5672b3e2ed394c461ab8ab0
SHA512 (xen415/xen-4.15.2.tar.gz) = 1cbf988fa8ed38b7ad724978958092ca0e5506e38c709c7d1af196fb8cb8ec0197a79867782761ef230b268624b3d7a0d5d0cd186f37d25f495085c71bf70d54
Size (xen415/xen-4.15.2.tar.gz) = 40773378 bytes
SHA1 (patch-Config.mk) = 9372a09efd05c9fbdbc06f8121e411fcb7c7ba65
+SHA1 (patch-XSA397) = caf9698a8817ae0728da9be6f2018392c9ab6634
+SHA1 (patch-XSA398) = e4fff05675bcf231f9fdf99e9773d1389cd0660c
+SHA1 (patch-XSA399) = c9ab4473654810ca2701dfc38c26e91a0d7f2eb5
+SHA1 (patch-XSA400) = 33d3ae929427ef3e8c74f9e1c36fc1d7e742a8f3
+SHA1 (patch-XSA401) = 8589aa9465c9416b4266beaad37a843de9906add
+SHA1 (patch-XSA402) = 5fe64577fcc249e202591d3a88ab423dbaf0ae42
+SHA1 (patch-XSA404) = ffb441cb248988b679707387e878ad0908082131
SHA1 (patch-xen_Makefile) = 465388d80de414ca3bb84faefa0f52d817e423a6
SHA1 (patch-xen_Rules.mk) = c743dc63f51fc280d529a7d9e08650292c171dac
SHA1 (patch-xen_arch_x86_Kconfig) = df14bfa09b9a0008ca59d53c938d43a644822dd9
Added files:
Index: pkgsrc/sysutils/xenkernel415/patches/patch-XSA397
diff -u /dev/null pkgsrc/sysutils/xenkernel415/patches/patch-XSA397:1.1
--- /dev/null Fri Jun 24 13:07:52 2022
+++ pkgsrc/sysutils/xenkernel415/patches/patch-XSA397 Fri Jun 24 13:07:52 2022
@@ -0,0 +1,100 @@
+$NetBSD: patch-XSA397,v 1.1 2022/06/24 13:07:52 bouyer Exp $
+
+From: Roger Pau Monne <roger.pau%citrix.com@localhost>
+Subject: x86/hap: do not switch on log dirty for VRAM tracking
+
+XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable
+when using HAP mode, and it can interact badly with other ongoing
+paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl
+lock.
+
+This was detected as a result of the following assert triggering when
+doing repeated migrations of a HAP HVM domain with a stubdom:
+
+Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198
+----[ Xen-4.17-unstable x86_64 debug=y Not tainted ]----
+CPU: 34
+RIP: e008:[<ffff82d040314b3b>] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6
+RFLAGS: 0000000000010206 CONTEXT: hypervisor (d0v23)
+[...]
+Xen call trace:
+ [<ffff82d040314b3b>] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a
+ [<ffff82d040279f96>] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67
+ [<ffff82d04031577f>] F paging_domctl+0x251/0xd41
+ [<ffff82d04031640c>] F paging_domctl_continuation+0x19d/0x202
+ [<ffff82d0403202fa>] F pv_hypercall+0x150/0x2a7
+ [<ffff82d0403a729d>] F lstar_enter+0x12d/0x140
+
+Such assert triggered because the stubdom used
+XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing
+XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while
+retiring the old structures, thus leading to new entries being
+populated in already clear slots.
+
+Fix this by not enabling log dirty for VRAM tracking, similar to what
+is done when using shadow instead of HAP. Call
+p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to
+get some hardware assistance if available. As a side effect the memory
+pressure on the p2m pool should go down if only VRAM tracking is
+enabled, as the dirty bitmap is no longer allocated.
+
+Note that paging_log_dirty_range (used to get the dirty bitmap for
+VRAM tracking) doesn't use the log dirty bitmap, and instead relies on
+checking whether each gfn on the range has been switched from
+p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages.
+
+This is CVE-2022-26356 / XSA-397.
+
+Signed-off-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+
+--- xen/include/asm-x86/paging.h.orig
++++ xen/include/asm-x86/paging.h
+@@ -162,9 +162,6 @@ void paging_log_dirty_range(struct domai
+ unsigned long nr,
+ uint8_t *dirty_bitmap);
+
+-/* enable log dirty */
+-int paging_log_dirty_enable(struct domain *d, bool log_global);
+-
+ /* log dirty initialization */
+ void paging_log_dirty_init(struct domain *d, const struct log_dirty_ops *ops);
+
+--- xen/arch/x86/mm/hap/hap.c.orig
++++ xen/arch/x86/mm/hap/hap.c
+@@ -69,13 +69,6 @@ int hap_track_dirty_vram(struct domain *
+ {
+ unsigned int size = DIV_ROUND_UP(nr_frames, BITS_PER_BYTE);
+
+- if ( !paging_mode_log_dirty(d) )
+- {
+- rc = paging_log_dirty_enable(d, false);
+- if ( rc )
+- goto out;
+- }
+-
+ rc = -ENOMEM;
+ dirty_bitmap = vzalloc(size);
+ if ( !dirty_bitmap )
+@@ -107,6 +100,10 @@ int hap_track_dirty_vram(struct domain *
+
+ paging_unlock(d);
+
++ domain_pause(d);
++ p2m_enable_hardware_log_dirty(d);
++ domain_unpause(d);
++
+ if ( oend > ostart )
+ p2m_change_type_range(d, ostart, oend,
+ p2m_ram_logdirty, p2m_ram_rw);
+--- xen/arch/x86/mm/paging.c.orig
++++ xen/arch/x86/mm/paging.c
+@@ -211,7 +211,7 @@ static int paging_free_log_dirty_bitmap(
+ return rc;
+ }
+
+-int paging_log_dirty_enable(struct domain *d, bool log_global)
++static int paging_log_dirty_enable(struct domain *d, bool log_global)
+ {
+ int ret;
+
Index: pkgsrc/sysutils/xenkernel415/patches/patch-XSA398
diff -u /dev/null pkgsrc/sysutils/xenkernel415/patches/patch-XSA398:1.1
--- /dev/null Fri Jun 24 13:07:52 2022
+++ pkgsrc/sysutils/xenkernel415/patches/patch-XSA398 Fri Jun 24 13:07:52 2022
@@ -0,0 +1,120 @@
+$NetBSD: patch-XSA398,v 1.1 2022/06/24 13:07:52 bouyer Exp $
+
+From 1b50f41b3bd800eb72064063da0c64b86d629f3a Mon Sep 17 00:00:00 2001
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Date: Mon, 7 Mar 2022 16:35:52 +0000
+Subject: x86/spec-ctrl: Cease using thunk=lfence on AMD
+
+AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer
+considered safe. AMD are recommending using retpoline everywhere.
+
+Retpoline is incompatible with CET. All CET-capable hardware has efficient
+IBRS (specifically, not something retrofitted in microcode), so use IBRS (and
+STIBP for consistency sake).
+
+This is a logical change on AMD, but not on Intel as the default calculations
+would end up with these settings anyway. Leave behind a message if IBRS is
+found to be missing.
+
+Also update the default heuristics to never select THUNK_LFENCE. This causes
+AMD CPUs to change their default to retpoline.
+
+Also update the printed message to include the AMD MSR_SPEC_CTRL settings, and
+STIBP now that we set it for consistency sake.
+
+This is part of XSA-398 / CVE-2021-26401.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+(cherry picked from commit 8d03080d2a339840d3a59e0932a94f804e45110d)
+
+diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
+index 443802b3d2e5..2392537954c8 100644
+--- docs/misc/xen-command-line.pandoc.orig
++++ docs/misc/xen-command-line.pandoc
+@@ -2205,9 +2205,9 @@ to use.
+
+ If Xen was compiled with INDIRECT_THUNK support, `bti-thunk=` can be used to
+ select which of the thunks gets patched into the `__x86_indirect_thunk_%reg`
+-locations. The default thunk is `retpoline` (generally preferred for Intel
+-hardware), with the alternatives being `jmp` (a `jmp *%reg` gadget, minimal
+-overhead), and `lfence` (an `lfence; jmp *%reg` gadget, preferred for AMD).
++locations. The default thunk is `retpoline` (generally preferred), with the
++alternatives being `jmp` (a `jmp *%reg` gadget, minimal overhead), and
++`lfence` (an `lfence; jmp *%reg` gadget).
+
+ On hardware supporting IBRS (Indirect Branch Restricted Speculation), the
+ `ibrs=` option can be used to force or prevent Xen using the feature itself.
+diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
+index 9301d95bd705..7ded6ecba197 100644
+--- xen/arch/x86/spec_ctrl.c.orig
++++ xen/arch/x86/spec_ctrl.c
+@@ -367,14 +367,19 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
+ "\n");
+
+ /* Settings for Xen's protection, irrespective of guests. */
+- printk(" Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s, Other:%s%s%s%s%s\n",
++ printk(" Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s%s, Other:%s%s%s%s%s\n",
+ thunk == THUNK_NONE ? "N/A" :
+ thunk == THUNK_RETPOLINE ? "RETPOLINE" :
+ thunk == THUNK_LFENCE ? "LFENCE" :
+ thunk == THUNK_JMP ? "JMP" : "?",
+- !boot_cpu_has(X86_FEATURE_IBRSB) ? "No" :
++ (!boot_cpu_has(X86_FEATURE_IBRSB) &&
++ !boot_cpu_has(X86_FEATURE_IBRS)) ? "No" :
+ (default_xen_spec_ctrl & SPEC_CTRL_IBRS) ? "IBRS+" : "IBRS-",
+- !boot_cpu_has(X86_FEATURE_SSBD) ? "" :
++ (!boot_cpu_has(X86_FEATURE_STIBP) &&
++ !boot_cpu_has(X86_FEATURE_AMD_STIBP)) ? "" :
++ (default_xen_spec_ctrl & SPEC_CTRL_STIBP) ? " STIBP+" : " STIBP-",
++ (!boot_cpu_has(X86_FEATURE_SSBD) &&
++ !boot_cpu_has(X86_FEATURE_AMD_SSBD)) ? "" :
+ (default_xen_spec_ctrl & SPEC_CTRL_SSBD) ? " SSBD+" : " SSBD-",
+ !(caps & ARCH_CAPS_TSX_CTRL) ? "" :
+ (opt_tsx & 1) ? " TSX+" : " TSX-",
+@@ -916,10 +921,23 @@ void __init init_speculation_mitigations(void)
+ /*
+ * First, disable the use of retpolines if Xen is using shadow stacks, as
+ * they are incompatible.
++ *
++ * In the absence of retpolines, IBRS needs to be used for speculative
++ * safety. All CET-capable hardware has efficient IBRS.
+ */
+- if ( cpu_has_xen_shstk &&
+- (opt_thunk == THUNK_DEFAULT || opt_thunk == THUNK_RETPOLINE) )
+- thunk = THUNK_JMP;
++ if ( cpu_has_xen_shstk )
++ {
++ if ( !boot_cpu_has(X86_FEATURE_IBRSB) )
++ printk(XENLOG_WARNING "?!? CET active, but no MSR_SPEC_CTRL?\n");
++ else if ( opt_ibrs == -1 )
++ {
++ opt_ibrs = ibrs = true;
++ default_xen_spec_ctrl |= SPEC_CTRL_IBRS | SPEC_CTRL_STIBP;
++ }
++
++ if ( opt_thunk == THUNK_DEFAULT || opt_thunk == THUNK_RETPOLINE )
++ thunk = THUNK_JMP;
++ }
+
+ /*
+ * Has the user specified any custom BTI mitigations? If so, follow their
+@@ -951,16 +951,10 @@
+ if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) )
+ {
+ /*
+- * AMD's recommended mitigation is to set lfence as being dispatch
+- * serialising, and to use IND_THUNK_LFENCE.
+- */
+- if ( cpu_has_lfence_dispatch )
+- thunk = THUNK_LFENCE;
+- /*
+- * On Intel hardware, we'd like to use retpoline in preference to
++ * On all hardware, we'd like to use retpoline in preference to
+ * IBRS, but only if it is safe on this hardware.
+ */
+- else if ( retpoline_safe(caps) )
++ if ( retpoline_safe(caps) )
+ thunk = THUNK_RETPOLINE;
+ else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
+ ibrs = true;
Index: pkgsrc/sysutils/xenkernel415/patches/patch-XSA399
diff -u /dev/null pkgsrc/sysutils/xenkernel415/patches/patch-XSA399:1.1
--- /dev/null Fri Jun 24 13:07:52 2022
+++ pkgsrc/sysutils/xenkernel415/patches/patch-XSA399 Fri Jun 24 13:07:52 2022
@@ -0,0 +1,47 @@
+$NetBSD: patch-XSA399,v 1.1 2022/06/24 13:07:52 bouyer Exp $
+
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: VT-d: correct ordering of operations in cleanup_domid_map()
+
+The function may be called without any locks held (leaving aside the
+domctl one, which we surely don't want to depend on here), so needs to
+play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is
+to avoid context_set_domain_id()'s writing of domid_map[] to be reset to
+zero right away in the case of it racing the freeing of a DID.
+
+For the interaction with context_set_domain_id() and ->domid_map[] reads
+see the code comment.
+
+{check_,}cleanup_domid_map() are called with pcidevs_lock held or during
+domain cleanup only (and pcidevs_lock is also held around
+context_set_domain_id()), i.e. racing calls with the same (dom, iommu)
+tuple cannot occur.
+
+domain_iommu_domid(), besides its use by cleanup_domid_map(), has its
+result used only to control flushing, and hence a stale result would
+only lead to a stray extra flush.
+
+This is CVE-2022-26357 / XSA-399.
+
+Fixes: b9c20c78789f ("VT-d: per-iommu domain-id")
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -152,8 +152,14 @@ static void cleanup_domid_map(struct dom
+
+ if ( iommu_domid >= 0 )
+ {
++ /*
++ * Update domid_map[] /before/ domid_bitmap[] to avoid a race with
++ * context_set_domain_id(), setting the slot to DOMID_INVALID for
++ * ->domid_map[] reads to produce a suitable value while the bit is
++ * still set.
++ */
++ iommu->domid_map[iommu_domid] = DOMID_INVALID;
+ clear_bit(iommu_domid, iommu->domid_bitmap);
+- iommu->domid_map[iommu_domid] = 0;
+ }
+ }
+
Index: pkgsrc/sysutils/xenkernel415/patches/patch-XSA400
diff -u /dev/null pkgsrc/sysutils/xenkernel415/patches/patch-XSA400:1.1
--- /dev/null Fri Jun 24 13:07:52 2022
+++ pkgsrc/sysutils/xenkernel415/patches/patch-XSA400 Fri Jun 24 13:07:52 2022
@@ -0,0 +1,3142 @@
+$NetBSD: patch-XSA400,v 1.1 2022/06/24 13:07:52 bouyer Exp $
+
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: VT-d: fix (de)assign ordering when RMRRs are in use
+
+In the event that the RMRR mappings are essential for device operation,
+they should be established before updating the device's context entry,
+while they should be torn down only after the device's context entry was
+successfully updated.
+
+Also adjust a related log message.
+
+This is CVE-2022-26358 / part of XSA-400.
+
+Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -2392,6 +2392,10 @@ static int reassign_device_ownership(
+ {
+ int ret;
+
++ ret = domain_context_unmap(source, devfn, pdev);
++ if ( ret )
++ return ret;
++
+ /*
+ * Devices assigned to untrusted domains (here assumed to be any domU)
+ * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
+@@ -2428,10 +2432,6 @@ static int reassign_device_ownership(
+ }
+ }
+
+- ret = domain_context_unmap(source, devfn, pdev);
+- if ( ret )
+- return ret;
+-
+ if ( devfn == pdev->devfn && pdev->domain != dom_io )
+ {
+ list_move(&pdev->domain_list, &dom_io->pdev_list);
+@@ -2507,9 +2507,8 @@ static int intel_iommu_assign_device(
+ }
+ }
+
+- ret = reassign_device_ownership(s, d, devfn, pdev);
+- if ( ret || d == dom_io )
+- return ret;
++ if ( d == dom_io )
++ return reassign_device_ownership(s, d, devfn, pdev);
+
+ /* Setup rmrr identity mapping */
+ for_each_rmrr_device( rmrr, bdf, i )
+@@ -2522,20 +2521,37 @@ static int intel_iommu_assign_device(
+ rmrr->end_address, flag);
+ if ( ret )
+ {
+- int rc;
+-
+- rc = reassign_device_ownership(d, s, devfn, pdev);
+ printk(XENLOG_G_ERR VTDPREFIX
+- " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n",
+- rmrr->base_address, rmrr->end_address,
+- d->domain_id, ret);
+- if ( rc )
+- {
+- printk(XENLOG_ERR VTDPREFIX
+- " failed to reclaim %pp from %pd (%d)\n",
+- &PCI_SBDF3(seg, bus, devfn), d, rc);
+- domain_crash(d);
+- }
++ "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
++ d, rmrr->base_address, rmrr->end_address, ret);
++ break;
++ }
++ }
++ }
++
++ if ( !ret )
++ ret = reassign_device_ownership(s, d, devfn, pdev);
++
++ /* See reassign_device_ownership() for the hwdom aspect. */
++ if ( !ret || is_hardware_domain(d) )
++ return ret;
++
++ for_each_rmrr_device( rmrr, bdf, i )
++ {
++ if ( rmrr->segment == seg &&
++ PCI_BUS(bdf) == bus &&
++ PCI_DEVFN2(bdf) == devfn )
++ {
++ int rc = iommu_identity_mapping(d, p2m_access_x,
++ rmrr->base_address,
++ rmrr->end_address, 0);
++
++ if ( rc && rc != -ENOENT )
++ {
++ printk(XENLOG_ERR VTDPREFIX
++ "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
++ d, rmrr->base_address, rmrr->end_address, rc);
++ domain_crash(d);
+ break;
+ }
+ }
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: VT-d: fix add/remove ordering when RMRRs are in use
+
+In the event that the RMRR mappings are essential for device operation,
+they should be established before updating the device's context entry,
+while they should be torn down only after the device's context entry was
+successfully cleared.
+
+Also switch to %pd in related log messages.
+
+Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
+Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -1981,14 +1981,6 @@ static int intel_iommu_add_device(u8 dev
+ if ( !pdev->domain )
+ return -EINVAL;
+
+- ret = domain_context_mapping(pdev->domain, devfn, pdev);
+- if ( ret )
+- {
+- dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
+- pdev->domain->domain_id);
+- return ret;
+- }
+-
+ for_each_rmrr_device ( rmrr, bdf, i )
+ {
+ if ( rmrr->segment == pdev->seg &&
+@@ -2005,12 +1997,17 @@ static int intel_iommu_add_device(u8 dev
+ rmrr->base_address, rmrr->end_address,
+ 0);
+ if ( ret )
+- dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
+- pdev->domain->domain_id);
++ dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
++ pdev->domain);
+ }
+ }
+
+- return 0;
++ ret = domain_context_mapping(pdev->domain, devfn, pdev);
++ if ( ret )
++ dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
++ pdev->domain);
++
++ return ret;
+ }
+
+ static int intel_iommu_enable_device(struct pci_dev *pdev)
+@@ -2032,11 +2029,15 @@ static int intel_iommu_remove_device(u8
+ {
+ struct acpi_rmrr_unit *rmrr;
+ u16 bdf;
+- int i;
++ int ret, i;
+
+ if ( !pdev->domain )
+ return -EINVAL;
+
++ ret = domain_context_unmap(pdev->domain, devfn, pdev);
++ if ( ret )
++ return ret;
++
+ for_each_rmrr_device ( rmrr, bdf, i )
+ {
+ if ( rmrr->segment != pdev->seg ||
+@@ -2052,7 +2053,7 @@ static int intel_iommu_remove_device(u8
+ rmrr->end_address, 0);
+ }
+
+- return domain_context_unmap(pdev->domain, devfn, pdev);
++ return 0;
+ }
+
+ static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev)
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: IOMMU/x86: tighten iommu_alloc_pgtable()'s parameter
+
+This is to make more obvious that nothing outside of domain_iommu(d)
+actually changes or is otherwise needed by the function.
+
+No functional change intended.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+
+--- xen/include/asm-x86/iommu.h.orig
++++ xen/include/asm-x86/iommu.h
+@@ -143,7 +143,8 @@ int pi_update_irte(const struct pi_desc
+ })
+
+ int __must_check iommu_free_pgtables(struct domain *d);
+-struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
++struct domain_iommu;
++struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
+
+ #endif /* !__ARCH_X86_IOMMU_H__ */
+ /*
+--- xen/drivers/passthrough/amd/iommu_map.c.orig
++++ xen/drivers/passthrough/amd/iommu_map.c
+@@ -184,7 +184,7 @@ static int iommu_pde_from_dfn(struct dom
+ unsigned long next_table_mfn;
+ unsigned int level;
+ struct page_info *table;
+- const struct domain_iommu *hd = dom_iommu(d);
++ struct domain_iommu *hd = dom_iommu(d);
+
+ table = hd->arch.amd.root_table;
+ level = hd->arch.amd.paging_mode;
+@@ -220,7 +220,7 @@ static int iommu_pde_from_dfn(struct dom
+ mfn = next_table_mfn;
+
+ /* allocate lower level page table */
+- table = iommu_alloc_pgtable(d);
++ table = iommu_alloc_pgtable(hd);
+ if ( table == NULL )
+ {
+ AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
+@@ -250,7 +250,7 @@ static int iommu_pde_from_dfn(struct dom
+
+ if ( next_table_mfn == 0 )
+ {
+- table = iommu_alloc_pgtable(d);
++ table = iommu_alloc_pgtable(hd);
+ if ( table == NULL )
+ {
+ AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
+@@ -483,7 +483,7 @@ int __init amd_iommu_quarantine_init(str
+
+ spin_lock(&hd->arch.mapping_lock);
+
+- hd->arch.amd.root_table = iommu_alloc_pgtable(d);
++ hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
+ if ( !hd->arch.amd.root_table )
+ goto out;
+
+@@ -498,7 +498,7 @@ int __init amd_iommu_quarantine_init(str
+ * page table pages, and the resulting allocations are always
+ * zeroed.
+ */
+- pg = iommu_alloc_pgtable(d);
++ pg = iommu_alloc_pgtable(hd);
+ if ( !pg )
+ break;
+
+--- xen/drivers/passthrough/amd/pci_amd_iommu.c.orig
++++ xen/drivers/passthrough/amd/pci_amd_iommu.c
+@@ -208,7 +208,7 @@ int amd_iommu_alloc_root(struct domain *
+
+ if ( unlikely(!hd->arch.amd.root_table) )
+ {
+- hd->arch.amd.root_table = iommu_alloc_pgtable(d);
++ hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
+ if ( !hd->arch.amd.root_table )
+ return -ENOMEM;
+ }
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -327,7 +327,7 @@ static u64 addr_to_dma_page_maddr(struct
+ {
+ struct page_info *pg;
+
+- if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
++ if ( !alloc || !(pg = iommu_alloc_pgtable(hd)) )
+ goto out;
+
+ hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
+@@ -347,7 +347,7 @@ static u64 addr_to_dma_page_maddr(struct
+ if ( !alloc )
+ break;
+
+- pg = iommu_alloc_pgtable(domain);
++ pg = iommu_alloc_pgtable(hd);
+ if ( !pg )
+ break;
+
+@@ -2761,7 +2761,7 @@ static int __init intel_iommu_quarantine
+ goto out;
+ }
+
+- pg = iommu_alloc_pgtable(d);
++ pg = iommu_alloc_pgtable(hd);
+
+ rc = -ENOMEM;
+ if ( !pg )
+@@ -2780,7 +2780,7 @@ static int __init intel_iommu_quarantine
+ * page table pages, and the resulting allocations are always
+ * zeroed.
+ */
+- pg = iommu_alloc_pgtable(d);
++ pg = iommu_alloc_pgtable(hd);
+
+ if ( !pg )
+ goto out;
+--- xen/drivers/passthrough/x86/iommu.c.orig
++++ xen/drivers/passthrough/x86/iommu.c
+@@ -415,9 +415,8 @@ int iommu_free_pgtables(struct domain *d
+ return 0;
+ }
+
+-struct page_info *iommu_alloc_pgtable(struct domain *d)
++struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd)
+ {
+- struct domain_iommu *hd = dom_iommu(d);
+ unsigned int memflags = 0;
+ struct page_info *pg;
+ void *p;
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: VT-d: drop ownership checking from domain_context_mapping_one()
+
+Despite putting in quite a bit of effort it was not possible to
+establish why exactly this code exists (beyond possibly sanity
+checking). Instead of a subsequent change further complicating this
+logic, simply get rid of it.
+
+Take the opportunity and move the respective unmap_vtd_domain_page() out
+of the locked region.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -121,28 +121,6 @@ static int context_set_domain_id(struct
+ return 0;
+ }
+
+-static int context_get_domain_id(struct context_entry *context,
+- struct vtd_iommu *iommu)
+-{
+- unsigned long dom_index, nr_dom;
+- int domid = -1;
+-
+- if (iommu && context)
+- {
+- nr_dom = cap_ndoms(iommu->cap);
+-
+- dom_index = context_domain_id(*context);
+-
+- if ( dom_index < nr_dom && iommu->domid_map )
+- domid = iommu->domid_map[dom_index];
+- else
+- dprintk(XENLOG_DEBUG VTDPREFIX,
+- "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n",
+- dom_index, nr_dom);
+- }
+- return domid;
+-}
+-
+ static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
+ {
+ int iommu_domid = domain_iommu_domid(domain, iommu);
+@@ -1404,44 +1382,9 @@ int domain_context_mapping_one(
+
+ if ( context_present(*context) )
+ {
+- int res = 0;
+-
+- /* Try to get domain ownership from device structure. If that's
+- * not available, try to read it from the context itself. */
+- if ( pdev )
+- {
+- if ( pdev->domain != domain )
+- {
+- printk(XENLOG_G_INFO VTDPREFIX "%pd: %pp owned by %pd",
+- domain, &PCI_SBDF3(seg, bus, devfn),
+- pdev->domain);
+- res = -EINVAL;
+- }
+- }
+- else
+- {
+- int cdomain;
+- cdomain = context_get_domain_id(context, iommu);
+-
+- if ( cdomain < 0 )
+- {
+- printk(XENLOG_G_WARNING VTDPREFIX
+- "%pd: %pp mapped, but can't find owner\n",
+- domain, &PCI_SBDF3(seg, bus, devfn));
+- res = -EINVAL;
+- }
+- else if ( cdomain != domain->domain_id )
+- {
+- printk(XENLOG_G_INFO VTDPREFIX
+- "%pd: %pp already mapped to d%d",
+- domain, &PCI_SBDF3(seg, bus, devfn), cdomain);
+- res = -EINVAL;
+- }
+- }
+-
+- unmap_vtd_domain_page(context_entries);
+ spin_unlock(&iommu->lock);
+- return res;
++ unmap_vtd_domain_page(context_entries);
++ return 0;
+ }
+
+ if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: VT-d: re-assign devices directly
+
+Devices with RMRRs, due to it being unspecified how/when the specified
+memory regions may get accessed, may not be left disconnected from their
+respective mappings (as long as it's not certain that the device has
+been fully quiesced). Hence rather than unmapping the old context and
+then mapping the new one, re-assignment needs to be done in a single
+step.
+
+This is CVE-2022-26359 / part of XSA-400.
+
+Reported-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+Similarly quarantining scratch-page mode relies on page tables to be
+continuously wired up.
+
+To avoid complicating things more than necessary, treat all devices
+mostly equally, i.e. regardless of their association with any RMRRs. The
+main difference is when it comes to updating context entries, which need
+to be atomic when there are RMRRs. Yet atomicity can only be achieved
+with CMPXCHG16B, availability of which we can't take for given.
+
+The seemingly complicated choice of non-negative return values for
+domain_context_mapping_one() is to limit code churn: This way callers
+passing NULL for pdev don't need fiddling with.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+--- xen/drivers/passthrough/vtd/extern.h.orig
++++ xen/drivers/passthrough/vtd/extern.h
+@@ -84,7 +84,8 @@ void free_pgtable_maddr(u64 maddr);
+ void *map_vtd_domain_page(u64 maddr);
+ void unmap_vtd_domain_page(const void *va);
+ int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
+- u8 bus, u8 devfn, const struct pci_dev *);
++ uint8_t bus, uint8_t devfn,
++ const struct pci_dev *pdev, unsigned int mode);
+ int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
+ u8 bus, u8 devfn);
+ int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
+@@ -103,8 +104,8 @@ int is_igd_vt_enabled_quirk(void);
+ void platform_quirks_init(void);
+ void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
+ void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
+-int __must_check me_wifi_quirk(struct domain *domain,
+- u8 bus, u8 devfn, int map);
++int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
++ uint8_t devfn, unsigned int mode);
+ void pci_vtd_quirk(const struct pci_dev *);
+ void quirk_iommu_caps(struct vtd_iommu *iommu);
+
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -117,6 +117,7 @@ static int context_set_domain_id(struct
+ }
+
+ set_bit(i, iommu->domid_bitmap);
++ context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+ context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+ return 0;
+ }
+@@ -1362,15 +1363,27 @@ static void __hwdom_init intel_iommu_hwd
+ }
+ }
+
++/*
++ * This function returns
++ * - a negative errno value upon error,
++ * - zero upon success when previously the entry was non-present, or this isn't
++ * the "main" request for a device (pdev == NULL), or for no-op quarantining
++ * assignments,
++ * - positive (one) upon success when previously the entry was present and this
++ * is the "main" request for a device (pdev != NULL).
++ */
+ int domain_context_mapping_one(
+ struct domain *domain,
+ struct vtd_iommu *iommu,
+- u8 bus, u8 devfn, const struct pci_dev *pdev)
++ uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
++ unsigned int mode)
+ {
+ struct domain_iommu *hd = dom_iommu(domain);
+- struct context_entry *context, *context_entries;
++ struct context_entry *context, *context_entries, lctxt;
++ __uint128_t old;
+ u64 maddr, pgd_maddr;
+- u16 seg = iommu->drhd->segment;
++ uint16_t seg = iommu->drhd->segment, prev_did = 0;
++ struct domain *prev_dom = NULL;
+ int rc, ret;
+ bool_t flush_dev_iotlb;
+
+@@ -1379,17 +1392,32 @@ int domain_context_mapping_one(
+ maddr = bus_to_context_maddr(iommu, bus);
+ context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
+ context = &context_entries[devfn];
++ old = (lctxt = *context).full;
+
+- if ( context_present(*context) )
++ if ( context_present(lctxt) )
+ {
+- spin_unlock(&iommu->lock);
+- unmap_vtd_domain_page(context_entries);
+- return 0;
++ domid_t domid;
++
++ prev_did = context_domain_id(lctxt);
++ domid = iommu->domid_map[prev_did];
++ if ( domid < DOMID_FIRST_RESERVED )
++ prev_dom = rcu_lock_domain_by_id(domid);
++ else if ( domid == DOMID_IO )
++ prev_dom = rcu_lock_domain(dom_io);
++ if ( !prev_dom )
++ {
++ spin_unlock(&iommu->lock);
++ unmap_vtd_domain_page(context_entries);
++ dprintk(XENLOG_DEBUG VTDPREFIX,
++ "no domain for did %u (nr_dom %u)\n",
++ prev_did, cap_ndoms(iommu->cap));
++ return -ESRCH;
++ }
+ }
+
+ if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
+ {
+- context_set_translation_type(*context, CONTEXT_TT_PASS_THRU);
++ context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
+ }
+ else
+ {
+@@ -1401,36 +1429,107 @@ int domain_context_mapping_one(
+ spin_unlock(&hd->arch.mapping_lock);
+ spin_unlock(&iommu->lock);
+ unmap_vtd_domain_page(context_entries);
++ if ( prev_dom )
++ rcu_unlock_domain(prev_dom);
+ return -ENOMEM;
+ }
+
+- context_set_address_root(*context, pgd_maddr);
++ context_set_address_root(lctxt, pgd_maddr);
+ if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
+- context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB);
++ context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
+ else
+- context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
++ context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
+
+ spin_unlock(&hd->arch.mapping_lock);
+ }
+
+- if ( context_set_domain_id(context, domain, iommu) )
++ if ( context_set_domain_id(&lctxt, domain, iommu) )
+ {
++ unlock:
+ spin_unlock(&iommu->lock);
+ unmap_vtd_domain_page(context_entries);
++ if ( prev_dom )
++ rcu_unlock_domain(prev_dom);
+ return -EFAULT;
+ }
+
+- context_set_address_width(*context, level_to_agaw(iommu->nr_pt_levels));
+- context_set_fault_enable(*context);
+- context_set_present(*context);
++ if ( !prev_dom )
++ {
++ context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels));
++ context_set_fault_enable(lctxt);
++ context_set_present(lctxt);
++ }
++ else if ( prev_dom == domain )
++ {
++ ASSERT(lctxt.full == context->full);
++ rc = !!pdev;
++ goto unlock;
++ }
++ else
++ {
++ ASSERT(context_address_width(lctxt) ==
++ level_to_agaw(iommu->nr_pt_levels));
++ ASSERT(!context_fault_disable(lctxt));
++ }
++
++ if ( cpu_has_cx16 )
++ {
++ __uint128_t res = cmpxchg16b(context, &old, &lctxt.full);
++
++ /*
++ * Hardware does not update the context entry behind our backs,
++ * so the return value should match "old".
++ */
++ if ( res != old )
++ {
++ if ( pdev )
++ check_cleanup_domid_map(domain, pdev, iommu);
++ printk(XENLOG_ERR
++ "%pp: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
++ &PCI_SBDF3(pdev->seg, pdev->bus, devfn),
++ (uint64_t)(res >> 64), (uint64_t)res,
++ (uint64_t)(old >> 64), (uint64_t)old);
++ rc = -EILSEQ;
++ goto unlock;
++ }
++ }
++ else if ( !prev_dom || !(mode & MAP_WITH_RMRR) )
++ {
++ context_clear_present(*context);
++ iommu_sync_cache(context, sizeof(*context));
++
++ write_atomic(&context->hi, lctxt.hi);
++ /* No barrier should be needed between these two. */
++ write_atomic(&context->lo, lctxt.lo);
++ }
++ else /* Best effort, updating DID last. */
++ {
++ /*
++ * By non-atomically updating the context entry's DID field last,
++ * during a short window in time TLB entries with the old domain ID
++ * but the new page tables may be inserted. This could affect I/O
++ * of other devices using this same (old) domain ID. Such updating
++ * therefore is not a problem if this was the only device associated
++ * with the old domain ID. Diverting I/O of any of a dying domain's
++ * devices to the quarantine page tables is intended anyway.
++ */
++ if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) )
++ printk(XENLOG_WARNING VTDPREFIX
++ " %pp: reassignment may cause %pd data corruption\n",
++ &PCI_SBDF3(seg, bus, devfn), prev_dom);
++
++ write_atomic(&context->lo, lctxt.lo);
++ /* No barrier should be needed between these two. */
++ write_atomic(&context->hi, lctxt.hi);
++ }
++
+ iommu_sync_cache(context, sizeof(struct context_entry));
+ spin_unlock(&iommu->lock);
+
+- /* Context entry was previously non-present (with domid 0). */
+- rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn),
+- DMA_CCMD_MASK_NOBIT, 1);
++ rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn),
++ DMA_CCMD_MASK_NOBIT, !prev_dom);
+ flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
+- ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
++ ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
+
+ /*
+ * The current logic for returns:
+@@ -1451,17 +1550,26 @@ int domain_context_mapping_one(
+ unmap_vtd_domain_page(context_entries);
+
+ if ( !seg && !rc )
+- rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
++ rc = me_wifi_quirk(domain, bus, devfn, mode);
+
+ if ( rc )
+ {
+- ret = domain_context_unmap_one(domain, iommu, bus, devfn);
++ if ( !prev_dom )
++ ret = domain_context_unmap_one(domain, iommu, bus, devfn);
++ else if ( prev_dom != domain ) /* Avoid infinite recursion. */
++ ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
++ mode & MAP_WITH_RMRR) < 0;
++ else
++ ret = 1;
+
+ if ( !ret && pdev && pdev->devfn == devfn )
+ check_cleanup_domid_map(domain, pdev, iommu);
+ }
+
+- return rc;
++ if ( prev_dom )
++ rcu_unlock_domain(prev_dom);
++
++ return rc ?: pdev && prev_dom;
+ }
+
+ static int domain_context_unmap(struct domain *d, uint8_t devfn,
+@@ -1471,8 +1579,10 @@ static int domain_context_mapping(struct
+ struct pci_dev *pdev)
+ {
+ struct acpi_drhd_unit *drhd;
++ const struct acpi_rmrr_unit *rmrr;
+ int ret = 0;
+- uint16_t seg = pdev->seg;
++ unsigned int i, mode = 0;
++ uint16_t seg = pdev->seg, bdf;
+ uint8_t bus = pdev->bus, secbus;
+
+ drhd = acpi_find_matched_drhd_unit(pdev);
+@@ -1492,8 +1602,29 @@ static int domain_context_mapping(struct
+
+ ASSERT(pcidevs_locked());
+
++ for_each_rmrr_device( rmrr, bdf, i )
++ {
++ if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
++ continue;
++
++ mode |= MAP_WITH_RMRR;
++ break;
++ }
++
++ if ( domain != pdev->domain )
++ {
++ if ( pdev->domain->is_dying )
++ mode |= MAP_OWNER_DYING;
++ else if ( drhd &&
++ !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) &&
++ !pdev->phantom_stride )
++ mode |= MAP_SINGLE_DEVICE;
++ }
++
+ switch ( pdev->type )
+ {
++ bool prev_present;
++
+ case DEV_TYPE_PCI_HOST_BRIDGE:
+ if ( iommu_debug )
+ printk(VTDPREFIX "%pd:Hostbridge: skip %pp map\n",
+@@ -1512,7 +1643,9 @@ static int domain_context_mapping(struct
+ printk(VTDPREFIX "%pd:PCIe: map %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- pdev);
++ pdev, mode);
++ if ( ret > 0 )
++ ret = 0;
+ if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
+ enable_ats_device(pdev, &drhd->iommu->ats_devices);
+
+@@ -1524,9 +1657,10 @@ static int domain_context_mapping(struct
+ domain, &PCI_SBDF3(seg, bus, devfn));
+
+ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- pdev);
+- if ( ret )
++ pdev, mode);
++ if ( ret < 0 )
+ break;
++ prev_present = ret;
+
+ if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
+ {
+@@ -1534,6 +1668,15 @@ static int domain_context_mapping(struct
+ break;
+ ret = -ENXIO;
+ }
++ /*
++ * Strictly speaking if the device is the only one behind this bridge
++ * and the only one with this (secbus,0,0) tuple, it could be allowed
++ * to be re-assigned regardless of RMRR presence. But let's deal with
++ * that case only if it is actually found in the wild.
++ */
++ else if ( prev_present && (mode & MAP_WITH_RMRR) &&
++ domain != pdev->domain )
++ ret = -EOPNOTSUPP;
+
+ /*
+ * Mapping a bridge should, if anything, pass the struct pci_dev of
+@@ -1542,7 +1685,7 @@ static int domain_context_mapping(struct
+ */
+ if ( ret >= 0 )
+ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- NULL);
++ NULL, mode);
+
+ /*
+ * Devices behind PCIe-to-PCI/PCIx bridge may generate different
+@@ -1557,10 +1700,15 @@ static int domain_context_mapping(struct
+ if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
+ (secbus != pdev->bus || pdev->devfn != 0) )
+ ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
+- NULL);
++ NULL, mode);
+
+ if ( ret )
+- domain_context_unmap(domain, devfn, pdev);
++ {
++ if ( !prev_present )
++ domain_context_unmap(domain, devfn, pdev);
++ else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
++ domain_context_mapping(pdev->domain, devfn, pdev);
++ }
+
+ break;
+
+@@ -2336,9 +2484,8 @@ static int reassign_device_ownership(
+ {
+ int ret;
+
+- ret = domain_context_unmap(source, devfn, pdev);
+- if ( ret )
+- return ret;
++ if ( !has_arch_pdevs(target) )
++ vmx_pi_hooks_assign(target);
+
+ /*
+ * Devices assigned to untrusted domains (here assumed to be any domU)
+@@ -2348,6 +2495,31 @@ static int reassign_device_ownership(
+ if ( (target != hardware_domain) && !iommu_intremap )
+ untrusted_msi = true;
+
++ ret = domain_context_mapping(target, devfn, pdev);
++ if ( ret )
++ {
++ if ( !has_arch_pdevs(target) )
++ vmx_pi_hooks_deassign(target);
++ return ret;
++ }
++
++ if ( pdev->devfn == devfn )
++ {
++ const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
++
++ if ( drhd )
++ check_cleanup_domid_map(source, pdev, drhd->iommu);
++ }
++
++ if ( devfn == pdev->devfn && pdev->domain != target )
++ {
++ list_move(&pdev->domain_list, &target->pdev_list);
++ pdev->domain = target;
++ }
++
++ if ( !has_arch_pdevs(source) )
++ vmx_pi_hooks_deassign(source);
++
+ /*
+ * If the device belongs to the hardware domain, and it has RMRR, don't
+ * remove it from the hardware domain, because BIOS may use RMRR at
+@@ -2376,34 +2548,7 @@ static int reassign_device_ownership(
+ }
+ }
+
+- if ( devfn == pdev->devfn && pdev->domain != dom_io )
+- {
+- list_move(&pdev->domain_list, &dom_io->pdev_list);
+- pdev->domain = dom_io;
+- }
+-
+- if ( !has_arch_pdevs(source) )
+- vmx_pi_hooks_deassign(source);
+-
+- if ( !has_arch_pdevs(target) )
+- vmx_pi_hooks_assign(target);
+-
+- ret = domain_context_mapping(target, devfn, pdev);
+- if ( ret )
+- {
+- if ( !has_arch_pdevs(target) )
+- vmx_pi_hooks_deassign(target);
+-
+- return ret;
+- }
+-
+- if ( devfn == pdev->devfn && pdev->domain != target )
+- {
+- list_move(&pdev->domain_list, &target->pdev_list);
+- pdev->domain = target;
+- }
+-
+- return ret;
++ return 0;
+ }
+
+ static int intel_iommu_assign_device(
+--- xen/drivers/passthrough/vtd/iommu.h.orig
++++ xen/drivers/passthrough/vtd/iommu.h
+@@ -202,8 +202,12 @@ struct root_entry {
+ do {(root).val |= ((value) & PAGE_MASK_4K);} while(0)
+
+ struct context_entry {
+- u64 lo;
+- u64 hi;
++ union {
++ struct {
++ uint64_t lo, hi;
++ };
++ __uint128_t full;
++ };
+ };
+ #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
+ #define context_present(c) ((c).lo & 1)
+--- xen/drivers/passthrough/vtd/quirks.c.orig
++++ xen/drivers/passthrough/vtd/quirks.c
+@@ -344,7 +344,8 @@ void __init platform_quirks_init(void)
+ */
+
+ static int __must_check map_me_phantom_function(struct domain *domain,
+- u32 dev, int map)
++ unsigned int dev,
++ unsigned int mode)
+ {
+ struct acpi_drhd_unit *drhd;
+ struct pci_dev *pdev;
+@@ -355,9 +356,9 @@ static int __must_check map_me_phantom_f
+ drhd = acpi_find_matched_drhd_unit(pdev);
+
+ /* map or unmap ME phantom function */
+- if ( map )
++ if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
+ rc = domain_context_mapping_one(domain, drhd->iommu, 0,
+- PCI_DEVFN(dev, 7), NULL);
++ PCI_DEVFN(dev, 7), NULL, mode);
+ else
+ rc = domain_context_unmap_one(domain, drhd->iommu, 0,
+ PCI_DEVFN(dev, 7));
+@@ -365,7 +366,8 @@ static int __must_check map_me_phantom_f
+ return rc;
+ }
+
+-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
++int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
++ unsigned int mode)
+ {
+ u32 id;
+ int rc = 0;
+@@ -389,7 +391,7 @@ int me_wifi_quirk(struct domain *domain,
+ case 0x423b8086:
+ case 0x423c8086:
+ case 0x423d8086:
+- rc = map_me_phantom_function(domain, 3, map);
++ rc = map_me_phantom_function(domain, 3, mode);
+ break;
+ default:
+ break;
+@@ -415,7 +417,7 @@ int me_wifi_quirk(struct domain *domain,
+ case 0x42388086: /* Puma Peak */
+ case 0x422b8086:
+ case 0x422c8086:
+- rc = map_me_phantom_function(domain, 22, map);
++ rc = map_me_phantom_function(domain, 22, mode);
+ break;
+ default:
+ break;
+--- xen/drivers/passthrough/vtd/vtd.h.orig
++++ xen/drivers/passthrough/vtd/vtd.h
+@@ -22,8 +22,14 @@
+
+ #include <xen/iommu.h>
+
+-#define MAP_ME_PHANTOM_FUNC 1
+-#define UNMAP_ME_PHANTOM_FUNC 0
++/*
++ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode"
++ * parameters.
++ */
++#define MAP_WITH_RMRR (1u << 0)
++#define MAP_OWNER_DYING (1u << 1)
++#define MAP_SINGLE_DEVICE (1u << 2)
++#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
+
+ /* Allow for both IOAPIC and IOSAPIC. */
+ #define IO_xAPIC_route_entry IO_APIC_route_entry
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: AMD/IOMMU: re-assign devices directly
+
+Devices with unity map ranges, due to it being unspecified how/when
+these memory ranges may get accessed, may not be left disconnected from
+their unity mappings (as long as it's not certain that the device has
+been fully quiesced). Hence rather than tearing down the old root page
+table pointer and then establishing the new one, re-assignment needs to
+be done in a single step.
+
+This is CVE-2022-26360 / part of XSA-400.
+
+Reported-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+Similarly quarantining scratch-page mode relies on page tables to be
+continuously wired up.
+
+To avoid complicating things more than necessary, treat all devices
+mostly equally, i.e. regardless of their association with any unity map
+ranges. The main difference is when it comes to updating DTEs, which need
+to be atomic when there are unity mappings. Yet atomicity can only be
+achieved with CMPXCHG16B, availability of which we can't take for given.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+--- xen/drivers/passthrough/amd/iommu.h.orig
++++ xen/drivers/passthrough/amd/iommu.h
+@@ -247,9 +247,13 @@ void amd_iommu_set_intremap_table(struct
+ const void *ptr,
+ const struct amd_iommu *iommu,
+ bool valid);
+-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+- uint64_t root_ptr, uint16_t domain_id,
+- uint8_t paging_mode, bool valid);
++#define SET_ROOT_VALID (1u << 0)
++#define SET_ROOT_WITH_UNITY_MAP (1u << 1)
++int __must_check amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
++ uint64_t root_ptr,
++ uint16_t domain_id,
++ uint8_t paging_mode,
++ unsigned int flags);
+ void iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
+ const struct ivrs_mappings *ivrs_dev);
+
+--- xen/drivers/passthrough/amd/iommu_map.c.orig
++++ xen/drivers/passthrough/amd/iommu_map.c
+@@ -114,10 +114,69 @@ static unsigned int set_iommu_ptes_prese
+ return flush_flags;
+ }
+
+-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+- uint64_t root_ptr, uint16_t domain_id,
+- uint8_t paging_mode, bool valid)
++/*
++ * This function returns
++ * - -errno for errors,
++ * - 0 for a successful update, atomic when necessary
++ * - 1 for a successful but non-atomic update, which may need to be warned
++ * about by the caller.
++ */
++int amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
++ uint64_t root_ptr, uint16_t domain_id,
++ uint8_t paging_mode, unsigned int flags)
+ {
++ bool valid = flags & SET_ROOT_VALID;
++
++ if ( dte->v && dte->tv &&
++ (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) )
++ {
++ union {
++ struct amd_iommu_dte dte;
++ uint64_t raw64[4];
++ __uint128_t raw128[2];
++ } ldte = { .dte = *dte };
++ __uint128_t old = ldte.raw128[0];
++ int ret = 0;
++
++ ldte.dte.domain_id = domain_id;
++ ldte.dte.pt_root = paddr_to_pfn(root_ptr);
++ ldte.dte.iw = true;
++ ldte.dte.ir = true;
++ ldte.dte.paging_mode = paging_mode;
++ ldte.dte.v = valid;
++
++ if ( cpu_has_cx16 )
++ {
++ __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128[0]);
++
++ /*
++ * Hardware does not update the DTE behind our backs, so the
++ * return value should match "old".
++ */
++ if ( res != old )
++ {
++ printk(XENLOG_ERR
++ "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n",
++ domain_id,
++ (uint64_t)(res >> 64), (uint64_t)res,
++ (uint64_t)(old >> 64), (uint64_t)old);
++ ret = -EILSEQ;
++ }
++ }
++ else /* Best effort, updating domain_id last. */
++ {
++ uint64_t *ptr = (void *)dte;
++
++ write_atomic(ptr + 0, ldte.raw64[0]);
++ /* No barrier should be needed between these two. */
++ write_atomic(ptr + 1, ldte.raw64[1]);
++
++ ret = 1;
++ }
++
++ return ret;
++ }
++
+ if ( valid || dte->v )
+ {
+ dte->tv = false;
+@@ -132,6 +191,8 @@ void amd_iommu_set_root_page_table(struc
+ smp_wmb();
+ dte->tv = true;
+ dte->v = valid;
++
++ return 0;
+ }
+
+ void amd_iommu_set_intremap_table(
+--- xen/drivers/passthrough/amd/pci_amd_iommu.c.orig
++++ xen/drivers/passthrough/amd/pci_amd_iommu.c
+@@ -81,41 +81,82 @@ int get_dma_requestor_id(uint16_t seg, u
+ return req_id;
+ }
+
+-static void amd_iommu_setup_domain_device(
++static int __must_check allocate_domain_resources(struct domain *d)
++{
++ struct domain_iommu *hd = dom_iommu(d);
++ int rc;
++
++ spin_lock(&hd->arch.mapping_lock);
++ rc = amd_iommu_alloc_root(d);
++ spin_unlock(&hd->arch.mapping_lock);
++
++ return rc;
++}
++
++static bool any_pdev_behind_iommu(const struct domain *d,
++ const struct pci_dev *exclude,
++ const struct amd_iommu *iommu)
++{
++ const struct pci_dev *pdev;
++
++ for_each_pdev ( d, pdev )
++ {
++ if ( pdev == exclude )
++ continue;
++
++ if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
++ return true;
++ }
++
++ return false;
++}
++
++static int __must_check amd_iommu_setup_domain_device(
+ struct domain *domain, struct amd_iommu *iommu,
+ uint8_t devfn, struct pci_dev *pdev)
+ {
+ struct amd_iommu_dte *table, *dte;
+ unsigned long flags;
+- int req_id, valid = 1;
++ unsigned int req_id, sr_flags;
++ int rc;
+ u8 bus = pdev->bus;
+ const struct domain_iommu *hd = dom_iommu(domain);
++ const struct ivrs_mappings *ivrs_dev;
++
++ BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
+
+- BUG_ON( !hd->arch.amd.root_table ||
+- !hd->arch.amd.paging_mode ||
+- !iommu->dev_table.buffer );
++ rc = allocate_domain_resources(domain);
++ if ( rc )
++ return rc;
+
+- if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
+- valid = 0;
++ req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf);
++ ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
++ sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain)
++ ? 0 : SET_ROOT_VALID)
++ | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0);
+
+ /* get device-table entry */
+ req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
+ table = iommu->dev_table.buffer;
+ dte = &table[req_id];
++ ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+
+ spin_lock_irqsave(&iommu->lock, flags);
+
+ if ( !dte->v || !dte->tv )
+ {
+- const struct ivrs_mappings *ivrs_dev;
+-
+ /* bind DTE to domain page-tables */
+- amd_iommu_set_root_page_table(
+- dte, page_to_maddr(hd->arch.amd.root_table),
+- domain->domain_id, hd->arch.amd.paging_mode, valid);
++ rc = amd_iommu_set_root_page_table(
++ dte, page_to_maddr(hd->arch.amd.root_table),
++ domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
++ if ( rc )
++ {
++ ASSERT(rc < 0);
++ spin_unlock_irqrestore(&iommu->lock, flags);
++ return rc;
++ }
+
+ /* Undo what amd_iommu_disable_domain_device() may have done. */
+- ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+ if ( dte->it_root )
+ {
+ dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED;
+@@ -130,17 +171,73 @@ static void amd_iommu_setup_domain_devic
+ dte->i = ats_enabled;
+
+ amd_iommu_flush_device(iommu, req_id);
++ }
++ else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
++ {
++ /*
++ * Strictly speaking if the device is the only one with this requestor
++ * ID, it could be allowed to be re-assigned regardless of unity map
++ * presence. But let's deal with that case only if it is actually
++ * found in the wild.
++ */
++ if ( req_id != PCI_BDF2(bus, devfn) &&
++ (sr_flags & SET_ROOT_WITH_UNITY_MAP) )
++ rc = -EOPNOTSUPP;
++ else
++ rc = amd_iommu_set_root_page_table(
++ dte, page_to_maddr(hd->arch.amd.root_table),
++ domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
++ if ( rc < 0 )
++ {
++ spin_unlock_irqrestore(&iommu->lock, flags);
++ return rc;
++ }
++ if ( rc &&
++ domain != pdev->domain &&
++ /*
++ * By non-atomically updating the DTE's domain ID field last,
++ * during a short window in time TLB entries with the old domain
++ * ID but the new page tables may have been inserted. This could
++ * affect I/O of other devices using this same (old) domain ID.
++ * Such updating therefore is not a problem if this was the only
++ * device associated with the old domain ID. Diverting I/O of any
++ * of a dying domain's devices to the quarantine page tables is
++ * intended anyway.
++ */
++ !pdev->domain->is_dying &&
++ (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
++ pdev->phantom_stride) )
++ printk(" %pp: reassignment may cause %pd data corruption\n",
++ &PCI_SBDF3(pdev->seg, bus, devfn), pdev->domain);
++
++ /*
++ * Check remaining settings are still in place from an earlier call
++ * here. They're all independent of the domain, so should not have
++ * changed.
++ */
++ if ( dte->it_root )
++ ASSERT(dte->int_ctl == IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED);
++ ASSERT(dte->iv == iommu_intremap);
++ ASSERT(dte->ex == ivrs_dev->dte_allow_exclusion);
++ ASSERT(dte->sys_mgt == MASK_EXTR(ivrs_dev->device_flags,
++ ACPI_IVHD_SYSTEM_MGMT));
+
+- AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+- "root table = %#"PRIx64", "
+- "domain = %d, paging mode = %d\n",
+- req_id, pdev->type,
+- page_to_maddr(hd->arch.amd.root_table),
+- domain->domain_id, hd->arch.amd.paging_mode);
++ if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
++ iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
++ ASSERT(dte->i == ats_enabled);
++
++ amd_iommu_flush_device(iommu, req_id);
+ }
+
+ spin_unlock_irqrestore(&iommu->lock, flags);
+
++ AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
++ "root table = %#"PRIx64", "
++ "domain = %d, paging mode = %d\n",
++ req_id, pdev->type,
++ page_to_maddr(hd->arch.amd.root_table),
++ domain->domain_id, hd->arch.amd.paging_mode);
++
+ ASSERT(pcidevs_locked());
+
+ if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
+@@ -151,6 +248,8 @@ static void amd_iommu_setup_domain_devic
+
+ amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
+ }
++
++ return 0;
+ }
+
+ int __init acpi_ivrs_init(void)
+@@ -216,18 +315,6 @@ int amd_iommu_alloc_root(struct domain *
+ return 0;
+ }
+
+-static int __must_check allocate_domain_resources(struct domain *d)
+-{
+- struct domain_iommu *hd = dom_iommu(d);
+- int rc;
+-
+- spin_lock(&hd->arch.mapping_lock);
+- rc = amd_iommu_alloc_root(d);
+- spin_unlock(&hd->arch.mapping_lock);
+-
+- return rc;
+-}
+-
+ int __read_mostly amd_iommu_min_paging_mode = 1;
+
+ static int amd_iommu_domain_init(struct domain *d)
+@@ -340,7 +427,15 @@ static int reassign_device(struct domain
+ return -ENODEV;
+ }
+
+- amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
++ rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
++ if ( rc )
++ return rc;
++
++ if ( devfn == pdev->devfn && pdev->domain != target )
++ {
++ list_move(&pdev->domain_list, &target->pdev_list);
++ pdev->domain = target;
++ }
+
+ /*
+ * If the device belongs to the hardware domain, and it has a unity mapping,
+@@ -356,26 +451,9 @@ static int reassign_device(struct domain
+ return rc;
+ }
+
+- if ( devfn == pdev->devfn && pdev->domain != dom_io )
+- {
+- list_move(&pdev->domain_list, &dom_io->pdev_list);
+- pdev->domain = dom_io;
+- }
+-
+- rc = allocate_domain_resources(target);
+- if ( rc )
+- return rc;
+-
+- amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+ AMD_IOMMU_DEBUG("Re-assign %pp from dom%d to dom%d\n",
+ &pdev->sbdf, source->domain_id, target->domain_id);
+
+- if ( devfn == pdev->devfn && pdev->domain != target )
+- {
+- list_move(&pdev->domain_list, &target->pdev_list);
+- pdev->domain = target;
+- }
+-
+ return 0;
+ }
+
+@@ -490,8 +568,7 @@ static int amd_iommu_add_device(u8 devfn
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ }
+
+- amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+- return 0;
++ return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+ }
+
+ static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: VT-d: prepare for per-device quarantine page tables (part I)
+
+Arrange for domain ID and page table root to be passed around, the latter in
+particular to domain_pgd_maddr() such that taking it from the per-domain
+fields can be overridden.
+
+No functional change intended.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+
+--- xen/drivers/passthrough/vtd/extern.h.orig
++++ xen/drivers/passthrough/vtd/extern.h
+@@ -85,9 +85,10 @@ void *map_vtd_domain_page(u64 maddr);
+ void unmap_vtd_domain_page(const void *va);
+ int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
+ uint8_t bus, uint8_t devfn,
+- const struct pci_dev *pdev, unsigned int mode);
++ const struct pci_dev *pdev, domid_t domid,
++ paddr_t pgd_maddr, unsigned int mode);
+ int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
+- u8 bus, u8 devfn);
++ uint8_t bus, uint8_t devfn, domid_t domid);
+ int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
+
+ unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
+@@ -105,7 +106,8 @@ void platform_quirks_init(void);
+ void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
+ void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
+ int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
+- uint8_t devfn, unsigned int mode);
++ uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
++ unsigned int mode);
+ void pci_vtd_quirk(const struct pci_dev *);
+ void quirk_iommu_caps(struct vtd_iommu *iommu);
+
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -355,15 +355,17 @@ static u64 addr_to_dma_page_maddr(struct
+ return pte_maddr;
+ }
+
+-static uint64_t domain_pgd_maddr(struct domain *d, unsigned int nr_pt_levels)
++static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr,
++ unsigned int nr_pt_levels)
+ {
+ struct domain_iommu *hd = dom_iommu(d);
+- uint64_t pgd_maddr;
+ unsigned int agaw;
+
+ ASSERT(spin_is_locked(&hd->arch.mapping_lock));
+
+- if ( iommu_use_hap_pt(d) )
++ if ( pgd_maddr )
++ /* nothing */;
++ else if ( iommu_use_hap_pt(d) )
+ {
+ pagetable_t pgt = p2m_get_pagetable(p2m_get_hostp2m(d));
+
+@@ -1376,12 +1378,12 @@ int domain_context_mapping_one(
+ struct domain *domain,
+ struct vtd_iommu *iommu,
+ uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
+- unsigned int mode)
++ domid_t domid, paddr_t pgd_maddr, unsigned int mode)
+ {
+ struct domain_iommu *hd = dom_iommu(domain);
+ struct context_entry *context, *context_entries, lctxt;
+ __uint128_t old;
+- u64 maddr, pgd_maddr;
++ uint64_t maddr;
+ uint16_t seg = iommu->drhd->segment, prev_did = 0;
+ struct domain *prev_dom = NULL;
+ int rc, ret;
+@@ -1421,10 +1423,12 @@ int domain_context_mapping_one(
+ }
+ else
+ {
++ paddr_t root;
++
+ spin_lock(&hd->arch.mapping_lock);
+
+- pgd_maddr = domain_pgd_maddr(domain, iommu->nr_pt_levels);
+- if ( !pgd_maddr )
++ root = domain_pgd_maddr(domain, pgd_maddr, iommu->nr_pt_levels);
++ if ( !root )
+ {
+ spin_unlock(&hd->arch.mapping_lock);
+ spin_unlock(&iommu->lock);
+@@ -1434,7 +1438,7 @@ int domain_context_mapping_one(
+ return -ENOMEM;
+ }
+
+- context_set_address_root(lctxt, pgd_maddr);
++ context_set_address_root(lctxt, root);
+ if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
+ context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
+ else
+@@ -1550,15 +1554,21 @@ int domain_context_mapping_one(
+ unmap_vtd_domain_page(context_entries);
+
+ if ( !seg && !rc )
+- rc = me_wifi_quirk(domain, bus, devfn, mode);
++ rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
+
+ if ( rc )
+ {
+ if ( !prev_dom )
+- ret = domain_context_unmap_one(domain, iommu, bus, devfn);
++ ret = domain_context_unmap_one(domain, iommu, bus, devfn,
++ domain->domain_id);
+ else if ( prev_dom != domain ) /* Avoid infinite recursion. */
++ {
++ hd = dom_iommu(prev_dom);
+ ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
++ domain->domain_id,
++ hd->arch.vtd.pgd_maddr,
+ mode & MAP_WITH_RMRR) < 0;
++ }
+ else
+ ret = 1;
+
+@@ -1580,6 +1590,7 @@ static int domain_context_mapping(struct
+ {
+ struct acpi_drhd_unit *drhd;
+ const struct acpi_rmrr_unit *rmrr;
++ paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+ int ret = 0;
+ unsigned int i, mode = 0;
+ uint16_t seg = pdev->seg, bdf;
+@@ -1643,7 +1654,8 @@ static int domain_context_mapping(struct
+ printk(VTDPREFIX "%pd:PCIe: map %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- pdev, mode);
++ pdev, domain->domain_id, pgd_maddr,
++ mode);
+ if ( ret > 0 )
+ ret = 0;
+ if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
+@@ -1657,7 +1669,8 @@ static int domain_context_mapping(struct
+ domain, &PCI_SBDF3(seg, bus, devfn));
+
+ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- pdev, mode);
++ pdev, domain->domain_id, pgd_maddr,
++ mode);
+ if ( ret < 0 )
+ break;
+ prev_present = ret;
+@@ -1685,7 +1698,8 @@ static int domain_context_mapping(struct
+ */
+ if ( ret >= 0 )
+ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- NULL, mode);
++ NULL, domain->domain_id, pgd_maddr,
++ mode);
+
+ /*
+ * Devices behind PCIe-to-PCI/PCIx bridge may generate different
+@@ -1700,7 +1714,8 @@ static int domain_context_mapping(struct
+ if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
+ (secbus != pdev->bus || pdev->devfn != 0) )
+ ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
+- NULL, mode);
++ NULL, domain->domain_id, pgd_maddr,
++ mode);
+
+ if ( ret )
+ {
+@@ -1728,7 +1743,7 @@ static int domain_context_mapping(struct
+ int domain_context_unmap_one(
+ struct domain *domain,
+ struct vtd_iommu *iommu,
+- u8 bus, u8 devfn)
++ uint8_t bus, uint8_t devfn, domid_t domid)
+ {
+ struct context_entry *context, *context_entries;
+ u64 maddr;
+@@ -1786,7 +1801,7 @@ int domain_context_unmap_one(
+ unmap_vtd_domain_page(context_entries);
+
+ if ( !iommu->drhd->segment && !rc )
+- rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
++ rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC);
+
+ if ( rc && !is_hardware_domain(domain) && domain != dom_io )
+ {
+@@ -1837,7 +1852,8 @@ static int domain_context_unmap(struct d
+ if ( iommu_debug )
+ printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+- ret = domain_context_unmap_one(domain, iommu, bus, devfn);
++ ret = domain_context_unmap_one(domain, iommu, bus, devfn,
++ domain->domain_id);
+ if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
+ disable_ats_device(pdev);
+
+@@ -1847,7 +1863,8 @@ static int domain_context_unmap(struct d
+ if ( iommu_debug )
+ printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+- ret = domain_context_unmap_one(domain, iommu, bus, devfn);
++ ret = domain_context_unmap_one(domain, iommu, bus, devfn,
++ domain->domain_id);
+ if ( ret )
+ break;
+
+@@ -1873,12 +1890,15 @@ static int domain_context_unmap(struct d
+ /* PCIe to PCI/PCIx bridge */
+ if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+ {
+- ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
++ ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
++ domain->domain_id);
+ if ( !ret )
+- ret = domain_context_unmap_one(domain, iommu, secbus, 0);
++ ret = domain_context_unmap_one(domain, iommu, secbus, 0,
++ domain->domain_id);
+ }
+ else /* Legacy PCI bridge */
+- ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
++ ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
++ domain->domain_id);
+
+ break;
+
+--- xen/drivers/passthrough/vtd/quirks.c.orig
++++ xen/drivers/passthrough/vtd/quirks.c
+@@ -345,6 +345,8 @@ void __init platform_quirks_init(void)
+
+ static int __must_check map_me_phantom_function(struct domain *domain,
+ unsigned int dev,
++ domid_t domid,
++ paddr_t pgd_maddr,
+ unsigned int mode)
+ {
+ struct acpi_drhd_unit *drhd;
+@@ -358,16 +360,17 @@ static int __must_check map_me_phantom_f
+ /* map or unmap ME phantom function */
+ if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
+ rc = domain_context_mapping_one(domain, drhd->iommu, 0,
+- PCI_DEVFN(dev, 7), NULL, mode);
++ PCI_DEVFN(dev, 7), NULL,
++ domid, pgd_maddr, mode);
+ else
+ rc = domain_context_unmap_one(domain, drhd->iommu, 0,
+- PCI_DEVFN(dev, 7));
++ PCI_DEVFN(dev, 7), domid);
+
+ return rc;
+ }
+
+ int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
+- unsigned int mode)
++ domid_t domid, paddr_t pgd_maddr, unsigned int mode)
+ {
+ u32 id;
+ int rc = 0;
+@@ -391,7 +394,7 @@ int me_wifi_quirk(struct domain *domain,
+ case 0x423b8086:
+ case 0x423c8086:
+ case 0x423d8086:
+- rc = map_me_phantom_function(domain, 3, mode);
++ rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
+ break;
+ default:
+ break;
+@@ -417,7 +420,7 @@ int me_wifi_quirk(struct domain *domain,
+ case 0x42388086: /* Puma Peak */
+ case 0x422b8086:
+ case 0x422c8086:
+- rc = map_me_phantom_function(domain, 22, mode);
++ rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
+ break;
+ default:
+ break;
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: VT-d: prepare for per-device quarantine page tables (part II)
+
+Replace the passing of struct domain * by domid_t in preparation of
+per-device quarantine page tables also requiring per-device pseudo
+domain IDs, which aren't going to be associated with any struct domain
+instances.
+
+No functional change intended (except for slightly adjusted log message
+text).
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -59,8 +59,8 @@ static struct tasklet vtd_fault_tasklet;
+ static int setup_hwdom_device(u8 devfn, struct pci_dev *);
+ static void setup_hwdom_rmrr(struct domain *d);
+
+-static int domain_iommu_domid(struct domain *d,
+- struct vtd_iommu *iommu)
++static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
++ bool warn)
+ {
+ unsigned long nr_dom, i;
+
+@@ -68,16 +68,16 @@ static int domain_iommu_domid(struct dom
+ i = find_first_bit(iommu->domid_bitmap, nr_dom);
+ while ( i < nr_dom )
+ {
+- if ( iommu->domid_map[i] == d->domain_id )
++ if ( iommu->domid_map[i] == domid )
+ return i;
+
+ i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
+ }
+
+- if ( !d->is_dying )
++ if ( warn )
+ dprintk(XENLOG_ERR VTDPREFIX,
+- "Cannot get valid iommu %u domid: %pd\n",
+- iommu->index, d);
++ "No valid iommu %u domid for Dom%d\n",
++ iommu->index, domid);
+
+ return -1;
+ }
+@@ -85,8 +85,7 @@ static int domain_iommu_domid(struct dom
+ #define DID_FIELD_WIDTH 16
+ #define DID_HIGH_OFFSET 8
+ static int context_set_domain_id(struct context_entry *context,
+- struct domain *d,
+- struct vtd_iommu *iommu)
++ domid_t domid, struct vtd_iommu *iommu)
+ {
+ unsigned long nr_dom, i;
+ int found = 0;
+@@ -97,7 +96,7 @@ static int context_set_domain_id(struct
+ i = find_first_bit(iommu->domid_bitmap, nr_dom);
+ while ( i < nr_dom )
+ {
+- if ( iommu->domid_map[i] == d->domain_id )
++ if ( iommu->domid_map[i] == domid )
+ {
+ found = 1;
+ break;
+@@ -113,7 +112,7 @@ static int context_set_domain_id(struct
+ dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n");
+ return -EFAULT;
+ }
+- iommu->domid_map[i] = d->domain_id;
++ iommu->domid_map[i] = domid;
+ }
+
+ set_bit(i, iommu->domid_bitmap);
+@@ -122,9 +121,9 @@ static int context_set_domain_id(struct
+ return 0;
+ }
+
+-static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
++static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
+ {
+- int iommu_domid = domain_iommu_domid(domain, iommu);
++ int iommu_domid = get_iommu_did(domid, iommu, false);
+
+ if ( iommu_domid >= 0 )
+ {
+@@ -180,7 +179,7 @@ static void check_cleanup_domid_map(stru
+ if ( !found )
+ {
+ clear_bit(iommu->index, &dom_iommu(d)->arch.vtd.iommu_bitmap);
+- cleanup_domid_map(d, iommu);
++ cleanup_domid_map(d->domain_id, iommu);
+ }
+ }
+
+@@ -687,7 +686,7 @@ static int __must_check iommu_flush_iotl
+ continue;
+
+ flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
+- iommu_domid= domain_iommu_domid(d, iommu);
++ iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
+ if ( iommu_domid == -1 )
+ continue;
+
+@@ -1447,7 +1446,7 @@ int domain_context_mapping_one(
+ spin_unlock(&hd->arch.mapping_lock);
+ }
+
+- if ( context_set_domain_id(&lctxt, domain, iommu) )
++ if ( context_set_domain_id(&lctxt, domid, iommu) )
+ {
+ unlock:
+ spin_unlock(&iommu->lock);
+@@ -1768,7 +1767,7 @@ int domain_context_unmap_one(
+ context_clear_entry(*context);
+ iommu_sync_cache(context, sizeof(struct context_entry));
+
+- iommu_domid= domain_iommu_domid(domain, iommu);
++ iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying);
+ if ( iommu_domid == -1 )
+ {
+ spin_unlock(&iommu->lock);
+@@ -1938,7 +1937,7 @@ static void iommu_domain_teardown(struct
+ ASSERT(!hd->arch.vtd.pgd_maddr);
+
+ for_each_drhd_unit ( drhd )
+- cleanup_domid_map(d, drhd->iommu);
++ cleanup_domid_map(d->domain_id, drhd->iommu);
+ }
+
+ static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: IOMMU/x86: maintain a per-device pseudo domain ID
+
+In order to subsequently enable per-device quarantine page tables, we'll
+need domain-ID-like identifiers to be inserted in the respective device
+(AMD) or context (Intel) table entries alongside the per-device page
+table root addresses.
+
+Make use of "real" domain IDs occupying only half of the value range
+coverable by domid_t.
+
+Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
+leaks in case of error, but existing ones don't get plugged - that'll be
+the subject of a later change.
+
+The VT-d changes are slightly asymmetric, but this way we can avoid
+assigning pseudo domain IDs to devices which would never be mapped while
+still avoiding to add a new parameter to domain_context_unmap().
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+--- xen/include/asm-x86/iommu.h.orig
++++ xen/include/asm-x86/iommu.h
+@@ -142,6 +142,10 @@ int pi_update_irte(const struct pi_desc
+ iommu_vcall(ops, sync_cache, addr, size); \
+ })
+
++unsigned long *iommu_init_domid(void);
++domid_t iommu_alloc_domid(unsigned long *map);
++void iommu_free_domid(domid_t domid, unsigned long *map);
++
+ int __must_check iommu_free_pgtables(struct domain *d);
+ struct domain_iommu;
+ struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
+--- xen/include/asm-x86/pci.h.orig
++++ xen/include/asm-x86/pci.h
+@@ -15,6 +15,12 @@
+
+ struct arch_pci_dev {
+ vmask_t used_vectors;
++ /*
++ * These fields are (de)initialized under pcidevs-lock. Other uses of
++ * them don't race (de)initialization and hence don't strictly need any
++ * locking.
++ */
++ domid_t pseudo_domid;
+ };
+
+ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
+--- xen/drivers/passthrough/amd/iommu.h.orig
++++ xen/drivers/passthrough/amd/iommu.h
+@@ -96,6 +96,7 @@ struct amd_iommu {
+ struct ring_buffer cmd_buffer;
+ struct ring_buffer event_log;
+ struct ring_buffer ppr_log;
++ unsigned long *domid_map;
+
+ int exclusion_enable;
+ int exclusion_allow_all;
+--- xen/drivers/passthrough/amd/iommu_detect.c.orig
++++ xen/drivers/passthrough/amd/iommu_detect.c
+@@ -180,6 +180,11 @@ int __init amd_iommu_detect_one_acpi(
+ if ( rt )
+ goto out;
+
++ iommu->domid_map = iommu_init_domid();
++ rt = -ENOMEM;
++ if ( !iommu->domid_map )
++ goto out;
++
+ rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
+ if ( rt )
+ printk(XENLOG_ERR "Could not mark config space of %pp read-only (%d)\n",
+@@ -190,7 +195,10 @@ int __init amd_iommu_detect_one_acpi(
+
+ out:
+ if ( rt )
++ {
++ xfree(iommu->domid_map);
+ xfree(iommu);
++ }
+
+ return rt;
+ }
+--- xen/drivers/passthrough/amd/pci_amd_iommu.c.orig
++++ xen/drivers/passthrough/amd/pci_amd_iommu.c
+@@ -508,6 +508,8 @@ static int amd_iommu_add_device(u8 devfn
+ struct amd_iommu *iommu;
+ u16 bdf;
+ struct ivrs_mappings *ivrs_mappings;
++ bool fresh_domid = false;
++ int ret;
+
+ if ( !pdev->domain )
+ return -EINVAL;
+@@ -568,7 +570,22 @@ static int amd_iommu_add_device(u8 devfn
+ spin_unlock_irqrestore(&iommu->lock, flags);
+ }
+
+- return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
++ if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
++ {
++ pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
++ if ( pdev->arch.pseudo_domid == DOMID_INVALID )
++ return -ENOSPC;
++ fresh_domid = true;
++ }
++
++ ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
++ if ( ret && fresh_domid )
++ {
++ iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
++ pdev->arch.pseudo_domid = DOMID_INVALID;
++ }
++
++ return ret;
+ }
+
+ static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
+@@ -591,6 +608,9 @@ static int amd_iommu_remove_device(u8 de
+
+ amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
+
++ iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
++ pdev->arch.pseudo_domid = DOMID_INVALID;
++
+ ivrs_mappings = get_ivrs_mappings(pdev->seg);
+ bdf = PCI_BDF2(pdev->bus, devfn);
+ if ( amd_iommu_perdev_intremap &&
+--- xen/drivers/passthrough/pci.c.orig
++++ xen/drivers/passthrough/pci.c
+@@ -327,6 +327,7 @@ static struct pci_dev *alloc_pdev(struct
+ *((u8*) &pdev->bus) = bus;
+ *((u8*) &pdev->devfn) = devfn;
+ pdev->domain = NULL;
++ pdev->arch.pseudo_domid = DOMID_INVALID;
+ INIT_LIST_HEAD(&pdev->msi_list);
+
+ pos = pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+@@ -1276,8 +1277,12 @@ static int _dump_pci_devices(struct pci_
+
+ list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
+ {
+- printk("%pp - %pd - node %-3d - MSIs < ",
+- &pdev->sbdf, pdev->domain,
++ printk("%pp - ", &pdev->sbdf);
++ if ( pdev->domain == dom_io )
++ printk("DomIO:%x", pdev->arch.pseudo_domid);
++ else
++ printk("%pd", pdev->domain);
++ printk(" - node %-3d - MSIs < ",
+ (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
+ list_for_each_entry ( msi, &pdev->msi_list, list )
+ printk("%d ", msi->irq);
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -22,6 +22,7 @@
+ #include <xen/sched.h>
+ #include <xen/xmalloc.h>
+ #include <xen/domain_page.h>
++#include <xen/err.h>
+ #include <xen/iocap.h>
+ #include <xen/iommu.h>
+ #include <xen/numa.h>
+@@ -1204,7 +1205,7 @@ int __init iommu_alloc(struct acpi_drhd_
+ {
+ struct vtd_iommu *iommu;
+ unsigned long sagaw, nr_dom;
+- int agaw;
++ int agaw, rc;
+
+ if ( nr_iommus >= MAX_IOMMUS )
+ {
+@@ -1297,7 +1298,16 @@ int __init iommu_alloc(struct acpi_drhd_
+ if ( !iommu->domid_map )
+ return -ENOMEM;
+
++ iommu->pseudo_domid_map = iommu_init_domid();
++ rc = -ENOMEM;
++ if ( !iommu->pseudo_domid_map )
++ goto free;
++
+ return 0;
++
++ free:
++ iommu_free(drhd);
++ return rc;
+ }
+
+ void __init iommu_free(struct acpi_drhd_unit *drhd)
+@@ -1320,6 +1330,7 @@ void __init iommu_free(struct acpi_drhd_
+
+ xfree(iommu->domid_bitmap);
+ xfree(iommu->domid_map);
++ xfree(iommu->pseudo_domid_map);
+
+ if ( iommu->msi.irq >= 0 )
+ destroy_irq(iommu->msi.irq);
+@@ -1581,8 +1592,8 @@ int domain_context_mapping_one(
+ return rc ?: pdev && prev_dom;
+ }
+
+-static int domain_context_unmap(struct domain *d, uint8_t devfn,
+- struct pci_dev *pdev);
++static const struct acpi_drhd_unit *domain_context_unmap(
++ struct domain *d, uint8_t devfn, struct pci_dev *pdev);
+
+ static int domain_context_mapping(struct domain *domain, u8 devfn,
+ struct pci_dev *pdev)
+@@ -1590,6 +1601,7 @@ static int domain_context_mapping(struct
+ struct acpi_drhd_unit *drhd;
+ const struct acpi_rmrr_unit *rmrr;
+ paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
++ domid_t orig_domid = pdev->arch.pseudo_domid;
+ int ret = 0;
+ unsigned int i, mode = 0;
+ uint16_t seg = pdev->seg, bdf;
+@@ -1649,6 +1661,14 @@ static int domain_context_mapping(struct
+ break;
+
+ case DEV_TYPE_PCIe_ENDPOINT:
++ if ( iommu_quarantine && orig_domid == DOMID_INVALID )
++ {
++ pdev->arch.pseudo_domid =
++ iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
++ if ( pdev->arch.pseudo_domid == DOMID_INVALID )
++ return -ENOSPC;
++ }
++
+ if ( iommu_debug )
+ printk(VTDPREFIX "%pd:PCIe: map %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+@@ -1663,6 +1683,14 @@ static int domain_context_mapping(struct
+ break;
+
+ case DEV_TYPE_PCI:
++ if ( iommu_quarantine && orig_domid == DOMID_INVALID )
++ {
++ pdev->arch.pseudo_domid =
++ iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
++ if ( pdev->arch.pseudo_domid == DOMID_INVALID )
++ return -ENOSPC;
++ }
++
+ if ( iommu_debug )
+ printk(VTDPREFIX "%pd:PCI: map %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+@@ -1736,6 +1764,13 @@ static int domain_context_mapping(struct
+ if ( !ret && devfn == pdev->devfn )
+ pci_vtd_quirk(pdev);
+
++ if ( ret && drhd && orig_domid == DOMID_INVALID )
++ {
++ iommu_free_domid(pdev->arch.pseudo_domid,
++ drhd->iommu->pseudo_domid_map);
++ pdev->arch.pseudo_domid = DOMID_INVALID;
++ }
++
+ return ret;
+ }
+
+@@ -1818,8 +1853,10 @@ int domain_context_unmap_one(
+ return rc;
+ }
+
+-static int domain_context_unmap(struct domain *domain, u8 devfn,
+- struct pci_dev *pdev)
++static const struct acpi_drhd_unit *domain_context_unmap(
++ struct domain *domain,
++ uint8_t devfn,
++ struct pci_dev *pdev)
+ {
+ struct acpi_drhd_unit *drhd;
+ struct vtd_iommu *iommu;
+@@ -1829,7 +1866,7 @@ static int domain_context_unmap(struct d
+
+ drhd = acpi_find_matched_drhd_unit(pdev);
+ if ( !drhd )
+- return -ENODEV;
++ return ERR_PTR(-ENODEV);
+ iommu = drhd->iommu;
+
+ switch ( pdev->type )
+@@ -1839,7 +1876,7 @@ static int domain_context_unmap(struct d
+ printk(VTDPREFIX "%pd:Hostbridge: skip %pp unmap\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+ if ( !is_hardware_domain(domain) )
+- return -EPERM;
++ return ERR_PTR(-EPERM);
+ goto out;
+
+ case DEV_TYPE_PCIe_BRIDGE:
+@@ -1912,7 +1949,7 @@ static int domain_context_unmap(struct d
+ check_cleanup_domid_map(domain, pdev, iommu);
+
+ out:
+- return ret;
++ return ret ? ERR_PTR(ret) : drhd;
+ }
+
+ static void iommu_clear_root_pgtable(struct domain *d)
+@@ -2137,16 +2174,17 @@ static int intel_iommu_enable_device(str
+
+ static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
+ {
++ const struct acpi_drhd_unit *drhd;
+ struct acpi_rmrr_unit *rmrr;
+ u16 bdf;
+- int ret, i;
++ unsigned int i;
+
+ if ( !pdev->domain )
+ return -EINVAL;
+
+- ret = domain_context_unmap(pdev->domain, devfn, pdev);
+- if ( ret )
+- return ret;
++ drhd = domain_context_unmap(pdev->domain, devfn, pdev);
++ if ( IS_ERR(drhd) )
++ return PTR_ERR(drhd);
+
+ for_each_rmrr_device ( rmrr, bdf, i )
+ {
+@@ -2163,6 +2201,13 @@ static int intel_iommu_remove_device(u8
+ rmrr->end_address, 0);
+ }
+
++ if ( drhd )
++ {
++ iommu_free_domid(pdev->arch.pseudo_domid,
++ drhd->iommu->pseudo_domid_map);
++ pdev->arch.pseudo_domid = DOMID_INVALID;
++ }
++
+ return 0;
+ }
+
+--- xen/drivers/passthrough/vtd/iommu.h.orig
++++ xen/drivers/passthrough/vtd/iommu.h
+@@ -535,6 +535,7 @@ struct vtd_iommu {
+ } flush;
+
+ struct list_head ats_devices;
++ unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
+ unsigned long *domid_bitmap; /* domain id bitmap */
+ u16 *domid_map; /* domain id mapping array */
+ uint32_t version;
+--- xen/drivers/passthrough/x86/iommu.c.orig
++++ xen/drivers/passthrough/x86/iommu.c
+@@ -386,6 +386,53 @@ void __hwdom_init arch_iommu_hwdom_init(
+ return;
+ }
+
++unsigned long *__init iommu_init_domid(void)
++{
++ if ( !iommu_quarantine )
++ return ZERO_BLOCK_PTR;
++
++ BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
++
++ return xzalloc_array(unsigned long,
++ BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
++}
++
++domid_t iommu_alloc_domid(unsigned long *map)
++{
++ /*
++ * This is used uniformly across all IOMMUs, such that on typical
++ * systems we wouldn't re-use the same ID very quickly (perhaps never).
++ */
++ static unsigned int start;
++ unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
++
++ ASSERT(pcidevs_locked());
++
++ if ( idx >= UINT16_MAX - DOMID_MASK )
++ idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
++ if ( idx >= UINT16_MAX - DOMID_MASK )
++ return DOMID_INVALID;
++
++ __set_bit(idx, map);
++
++ start = idx + 1;
++
++ return idx | (DOMID_MASK + 1);
++}
++
++void iommu_free_domid(domid_t domid, unsigned long *map)
++{
++ ASSERT(pcidevs_locked());
++
++ if ( domid == DOMID_INVALID )
++ return;
++
++ ASSERT(domid > DOMID_MASK);
++
++ if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
++ BUG();
++}
++
+ int iommu_free_pgtables(struct domain *d)
+ {
+ struct domain_iommu *hd = dom_iommu(d);
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks
+
+The page tables just created aren't hooked up yet anywhere, so there's
+nothing that could be present in any TLB, and hence nothing to flush.
+Dropping this flush is, at least on the VT-d side, a prereq to per-
+device domain ID use when quarantining devices, as dom_io isn't going
+to be assigned a DID anymore: The warning in get_iommu_did() would
+trigger.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+
+--- xen/drivers/passthrough/amd/iommu_map.c.orig
++++ xen/drivers/passthrough/amd/iommu_map.c
+@@ -584,8 +584,6 @@ int __init amd_iommu_quarantine_init(str
+ out:
+ spin_unlock(&hd->arch.mapping_lock);
+
+- amd_iommu_flush_all_pages(d);
+-
+ /* Pages leaked in failure case */
+ return level ? -ENOMEM : 0;
+ }
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -2958,9 +2958,6 @@ static int __init intel_iommu_quarantine
+ out:
+ spin_unlock(&hd->arch.mapping_lock);
+
+- if ( !rc )
+- rc = iommu_flush_iotlb_all(d);
+-
+ /* Pages may be leaked in failure case */
+ return rc;
+ }
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: AMD/IOMMU: abstract maximum number of page table levels
+
+We will want to use the constant elsewhere.
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+
+--- xen/drivers/passthrough/amd/iommu.h.orig
++++ xen/drivers/passthrough/amd/iommu.h
+@@ -358,7 +358,7 @@ static inline int amd_iommu_get_paging_m
+ while ( max_frames > PTE_PER_TABLE_SIZE )
+ {
+ max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT;
+- if ( ++level > 6 )
++ if ( ++level > IOMMU_MAX_PT_LEVELS )
+ return -ENOMEM;
+ }
+
+--- xen/drivers/passthrough/amd/iommu-defs.h.orig
++++ xen/drivers/passthrough/amd/iommu-defs.h
+@@ -106,6 +106,7 @@ struct amd_iommu_dte {
+ bool tv:1;
+ unsigned int :5;
+ unsigned int had:2;
++#define IOMMU_MAX_PT_LEVELS 6
+ unsigned int paging_mode:3;
+ uint64_t pt_root:40;
+ bool ppr:1;
+--- xen/drivers/passthrough/amd/iommu_map.c.orig
++++ xen/drivers/passthrough/amd/iommu_map.c
+@@ -250,7 +250,7 @@ static int iommu_pde_from_dfn(struct dom
+ table = hd->arch.amd.root_table;
+ level = hd->arch.amd.paging_mode;
+
+- BUG_ON( table == NULL || level < 1 || level > 6 );
++ BUG_ON( table == NULL || level < 1 || level > IOMMU_MAX_PT_LEVELS );
+
+ /*
+ * A frame number past what the current page tables can represent can't
+From: Jan Beulich <jbeulich%suse.com@localhost>
+Subject: IOMMU/x86: use per-device page tables for quarantining
+
+Devices with RMRRs / unity mapped regions, due to it being unspecified
+how/when these memory regions may be accessed, may not be left
+disconnected from the mappings of these regions (as long as it's not
+certain that the device has been fully quiesced). Hence even the page
+tables used when quarantining such devices need to have mappings of
+those regions. This implies installing page tables in the first place
+even when not in scratch-page quarantining mode.
+
+This is CVE-2022-26361 / part of XSA-400.
+
+While for the purpose here it would be sufficient to have devices with
+RMRRs / unity mapped regions use per-device page tables, extend this to
+all devices (in scratch-page quarantining mode). This allows the leaf
+pages to be mapped r/w, thus covering also memory writes (rather than
+just reads) issued by non-quiescent devices.
+
+Set up quarantine page tables as late as possible, yet early enough to
+not encounter failure during de-assign. This means setup generally
+happens in assign_device(), while (for now) the one in deassign_device()
+is there mainly to be on the safe side.
+
+In VT-d's DID allocation function don't require the IOMMU lock to be
+held anymore: All involved code paths hold pcidevs_lock, so this way we
+avoid the need to acquire the IOMMU lock around the new call to
+context_set_domain_id().
+
+Signed-off-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Paul Durrant <paul%xen.org@localhost>
+Reviewed-by: Kevin Tian <kevin.tian%intel.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+--- xen/arch/x86/mm/p2m.c.orig
++++ xen/arch/x86/mm/p2m.c
+@@ -1468,7 +1468,7 @@ int set_identity_p2m_entry(struct domain
+ struct p2m_domain *p2m = p2m_get_hostp2m(d);
+ int ret;
+
+- if ( !paging_mode_translate(p2m->domain) )
++ if ( !paging_mode_translate(d) )
+ {
+ if ( !is_iommu_enabled(d) )
+ return 0;
+--- xen/include/asm-x86/pci.h.orig
++++ xen/include/asm-x86/pci.h
+@@ -1,6 +1,8 @@
+ #ifndef __X86_PCI_H__
+ #define __X86_PCI_H__
+
++#include <xen/mm.h>
++
+ #define CF8_BDF(cf8) ( ((cf8) & 0x00ffff00) >> 8)
+ #define CF8_ADDR_LO(cf8) ( (cf8) & 0x000000fc)
+ #define CF8_ADDR_HI(cf8) ( ((cf8) & 0x0f000000) >> 16)
+@@ -20,7 +22,18 @@ struct arch_pci_dev {
+ * them don't race (de)initialization and hence don't strictly need any
+ * locking.
+ */
++ union {
++ /* Subset of struct arch_iommu's fields, to be used in dom_io. */
++ struct {
++ uint64_t pgd_maddr;
++ } vtd;
++ struct {
++ struct page_info *root_table;
++ } amd;
++ };
+ domid_t pseudo_domid;
++ mfn_t leaf_mfn;
++ struct page_list_head pgtables_list;
+ };
+
+ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
+--- xen/drivers/passthrough/amd/iommu.h.orig
++++ xen/drivers/passthrough/amd/iommu.h
+@@ -223,7 +223,8 @@ int amd_iommu_init_late(void);
+ int amd_iommu_update_ivrs_mapping_acpi(void);
+ int iov_adjust_irq_affinities(void);
+
+-int amd_iommu_quarantine_init(struct domain *d);
++int amd_iommu_quarantine_init(struct pci_dev *pdev);
++void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
+
+ /* mapping functions */
+ int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
+--- xen/drivers/passthrough/amd/iommu_map.c.orig
++++ xen/drivers/passthrough/amd/iommu_map.c
+@@ -528,64 +528,135 @@ int amd_iommu_reserve_domain_unity_unmap
+ return rc;
+ }
+
+-int __init amd_iommu_quarantine_init(struct domain *d)
++static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
++ struct page_info *pgs[IOMMU_MAX_PT_LEVELS])
+ {
+- struct domain_iommu *hd = dom_iommu(d);
++ struct domain_iommu *hd = dom_iommu(dom_io);
++ unsigned int i;
++ int rc = 0;
++
++ for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
++ {
++ union amd_iommu_pte *pte = &this[i], *next;
++
++ if ( !pte->pr )
++ {
++ if ( !pgs[level] )
++ {
++ /*
++ * The pgtable allocator is fine for the leaf page, as well as
++ * page table pages, and the resulting allocations are always
++ * zeroed.
++ */
++ pgs[level] = iommu_alloc_pgtable(hd);
++ if ( !pgs[level] )
++ {
++ rc = -ENOMEM;
++ break;
++ }
++
++ if ( level )
++ {
++ next = __map_domain_page(pgs[level]);
++ rc = fill_qpt(next, level - 1, pgs);
++ unmap_domain_page(next);
++ }
++ }
++
++ /*
++ * PDEs are essentially a subset of PTEs, so this function
++ * is fine to use even at the leaf.
++ */
++ set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
++ true, true);
++ }
++ else if ( level && pte->next_level )
++ {
++ next = map_domain_page(_mfn(pte->mfn));
++ rc = fill_qpt(next, level - 1, pgs);
++ unmap_domain_page(next);
++ }
++ }
++
++ return rc;
++}
++
++int amd_iommu_quarantine_init(struct pci_dev *pdev)
++{
++ struct domain_iommu *hd = dom_iommu(dom_io);
+ unsigned long end_gfn =
+ 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT);
+ unsigned int level = amd_iommu_get_paging_mode(end_gfn);
+- union amd_iommu_pte *table;
++ unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
++ const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
++ int rc;
+
+- if ( hd->arch.amd.root_table )
++ ASSERT(pcidevs_locked());
++ ASSERT(!hd->arch.amd.root_table);
++ ASSERT(page_list_empty(&hd->arch.pgtables.list));
++
++ ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
++
++ if ( pdev->arch.amd.root_table )
+ {
+- ASSERT_UNREACHABLE();
++ clear_domain_page(pdev->arch.leaf_mfn);
+ return 0;
+ }
+
+- spin_lock(&hd->arch.mapping_lock);
+-
+- hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
+- if ( !hd->arch.amd.root_table )
+- goto out;
+-
+- table = __map_domain_page(hd->arch.amd.root_table);
+- while ( level )
++ pdev->arch.amd.root_table = iommu_alloc_pgtable(hd);
++ if ( !pdev->arch.amd.root_table )
++ return -ENOMEM;
++
++ /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
++ hd->arch.amd.root_table = pdev->arch.amd.root_table;
++
++ rc = amd_iommu_reserve_domain_unity_map(dom_io,
++ ivrs_mappings[req_id].unity_map,
++ 0);
++
++ iommu_identity_map_teardown(dom_io);
++ hd->arch.amd.root_table = NULL;
++
++ if ( rc )
++ printk("%pp: quarantine unity mapping failed\n", &pdev->sbdf);
++ else
+ {
+- struct page_info *pg;
+- unsigned int i;
++ union amd_iommu_pte *root;
++ struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
+
+- /*
+- * The pgtable allocator is fine for the leaf page, as well as
+- * page table pages, and the resulting allocations are always
+- * zeroed.
+- */
+- pg = iommu_alloc_pgtable(hd);
+- if ( !pg )
+- break;
++ spin_lock(&hd->arch.mapping_lock);
+
+- for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
+- {
+- union amd_iommu_pte *pde = &table[i];
++ root = __map_domain_page(pdev->arch.amd.root_table);
++ rc = fill_qpt(root, level - 1, pgs);
++ unmap_domain_page(root);
+
+- /*
+- * PDEs are essentially a subset of PTEs, so this function
+- * is fine to use even at the leaf.
+- */
+- set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
+- false, true);
+- }
++ pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
+
+- unmap_domain_page(table);
+- table = __map_domain_page(pg);
+- level--;
++ spin_unlock(&hd->arch.mapping_lock);
+ }
+- unmap_domain_page(table);
+
+- out:
+- spin_unlock(&hd->arch.mapping_lock);
++ page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
++
++ if ( rc )
++ amd_iommu_quarantine_teardown(pdev);
++
++ return rc;
++}
++
++void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
++{
++ struct domain_iommu *hd = dom_iommu(dom_io);
++
++ ASSERT(pcidevs_locked());
++
++ if ( !pdev->arch.amd.root_table )
++ return;
+
+- /* Pages leaked in failure case */
+- return level ? -ENOMEM : 0;
++ ASSERT(page_list_empty(&hd->arch.pgtables.list));
++ page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
++ while ( iommu_free_pgtables(dom_io) == -ERESTART )
++ /* nothing */;
++ pdev->arch.amd.root_table = NULL;
+ }
+
+ /*
+--- xen/drivers/passthrough/amd/pci_amd_iommu.c.orig
++++ xen/drivers/passthrough/amd/pci_amd_iommu.c
+@@ -122,6 +122,8 @@ static int __must_check amd_iommu_setup_
+ u8 bus = pdev->bus;
+ const struct domain_iommu *hd = dom_iommu(domain);
+ const struct ivrs_mappings *ivrs_dev;
++ const struct page_info *root_pg;
++ domid_t domid;
+
+ BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
+
+@@ -141,14 +143,25 @@ static int __must_check amd_iommu_setup_
+ dte = &table[req_id];
+ ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+
++ if ( domain != dom_io )
++ {
++ root_pg = hd->arch.amd.root_table;
++ domid = domain->domain_id;
++ }
++ else
++ {
++ root_pg = pdev->arch.amd.root_table;
++ domid = pdev->arch.pseudo_domid;
++ }
++
+ spin_lock_irqsave(&iommu->lock, flags);
+
+ if ( !dte->v || !dte->tv )
+ {
+ /* bind DTE to domain page-tables */
+ rc = amd_iommu_set_root_page_table(
+- dte, page_to_maddr(hd->arch.amd.root_table),
+- domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
++ dte, page_to_maddr(root_pg), domid,
++ hd->arch.amd.paging_mode, sr_flags);
+ if ( rc )
+ {
+ ASSERT(rc < 0);
+@@ -172,7 +185,7 @@ static int __must_check amd_iommu_setup_
+
+ amd_iommu_flush_device(iommu, req_id);
+ }
+- else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
++ else if ( dte->pt_root != mfn_x(page_to_mfn(root_pg)) )
+ {
+ /*
+ * Strictly speaking if the device is the only one with this requestor
+@@ -185,8 +198,8 @@ static int __must_check amd_iommu_setup_
+ rc = -EOPNOTSUPP;
+ else
+ rc = amd_iommu_set_root_page_table(
+- dte, page_to_maddr(hd->arch.amd.root_table),
+- domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
++ dte, page_to_maddr(root_pg), domid,
++ hd->arch.amd.paging_mode, sr_flags);
+ if ( rc < 0 )
+ {
+ spin_unlock_irqrestore(&iommu->lock, flags);
+@@ -205,6 +218,7 @@ static int __must_check amd_iommu_setup_
+ * intended anyway.
+ */
+ !pdev->domain->is_dying &&
++ pdev->domain != dom_io &&
+ (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
+ pdev->phantom_stride) )
+ printk(" %pp: reassignment may cause %pd data corruption\n",
+@@ -234,9 +248,8 @@ static int __must_check amd_iommu_setup_
+ AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+ "root table = %#"PRIx64", "
+ "domain = %d, paging mode = %d\n",
+- req_id, pdev->type,
+- page_to_maddr(hd->arch.amd.root_table),
+- domain->domain_id, hd->arch.amd.paging_mode);
++ req_id, pdev->type, page_to_maddr(root_pg),
++ domid, hd->arch.amd.paging_mode);
+
+ ASSERT(pcidevs_locked());
+
+@@ -305,7 +318,7 @@ int amd_iommu_alloc_root(struct domain *
+ {
+ struct domain_iommu *hd = dom_iommu(d);
+
+- if ( unlikely(!hd->arch.amd.root_table) )
++ if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
+ {
+ hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
+ if ( !hd->arch.amd.root_table )
+@@ -396,7 +409,7 @@ static void amd_iommu_disable_domain_dev
+
+ AMD_IOMMU_DEBUG("Disable: device id = %#x, "
+ "domain = %d, paging mode = %d\n",
+- req_id, domain->domain_id,
++ req_id, dte->domain_id,
+ dom_iommu(domain)->arch.amd.paging_mode);
+ }
+ spin_unlock_irqrestore(&iommu->lock, flags);
+@@ -608,6 +621,8 @@ static int amd_iommu_remove_device(u8 de
+
+ amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
+
++ amd_iommu_quarantine_teardown(pdev);
++
+ iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+ pdev->arch.pseudo_domid = DOMID_INVALID;
+
+--- xen/drivers/passthrough/iommu.c.orig
++++ xen/drivers/passthrough/iommu.c
+@@ -424,21 +424,21 @@ int iommu_iotlb_flush_all(struct domain
+ return rc;
+ }
+
+-static int __init iommu_quarantine_init(void)
++int iommu_quarantine_dev_init(device_t *dev)
+ {
+ const struct domain_iommu *hd = dom_iommu(dom_io);
+- int rc;
+
+- dom_io->options |= XEN_DOMCTL_CDF_iommu;
++ if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
++ return 0;
+
+- rc = iommu_domain_init(dom_io, 0);
+- if ( rc )
+- return rc;
++ return iommu_call(hd->platform_ops, quarantine_init, dev);
++}
+
+- if ( !hd->platform_ops->quarantine_init )
+- return 0;
++static int __init iommu_quarantine_init(void)
++{
++ dom_io->options |= XEN_DOMCTL_CDF_iommu;
+
+- return hd->platform_ops->quarantine_init(dom_io);
++ return iommu_domain_init(dom_io, 0);
+ }
+
+ int __init iommu_setup(void)
+--- xen/drivers/passthrough/pci.c.orig
++++ xen/drivers/passthrough/pci.c
+@@ -858,9 +858,16 @@ static int deassign_device(struct domain
+ return -ENODEV;
+
+ /* De-assignment from dom_io should de-quarantine the device */
+- target = ((pdev->quarantine || iommu_quarantine) &&
+- pdev->domain != dom_io) ?
+- dom_io : hardware_domain;
++ if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
++ {
++ ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
++ if ( ret )
++ return ret;
++
++ target = dom_io;
++ }
++ else
++ target = hardware_domain;
+
+ while ( pdev->phantom_stride )
+ {
+@@ -1441,6 +1448,13 @@ static int assign_device(struct domain *
+ msixtbl_init(d);
+ }
+
++ if ( pdev->domain != dom_io )
++ {
++ rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
++ if ( rc )
++ goto done;
++ }
++
+ pdev->fault.count = 0;
+
+ if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) )
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -43,6 +43,12 @@
+ #include "vtd.h"
+ #include "../ats.h"
+
++#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
++ : (pdev)->arch.pseudo_domid)
++#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
++ ? dom_iommu(d)->arch.vtd.pgd_maddr \
++ : (pdev)->arch.vtd.pgd_maddr)
++
+ /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
+ bool __read_mostly untrusted_msi;
+
+@@ -85,13 +91,18 @@ static int get_iommu_did(domid_t domid,
+
+ #define DID_FIELD_WIDTH 16
+ #define DID_HIGH_OFFSET 8
++
++/*
++ * This function may have "context" passed as NULL, to merely obtain a DID
++ * for "domid".
++ */
+ static int context_set_domain_id(struct context_entry *context,
+ domid_t domid, struct vtd_iommu *iommu)
+ {
+ unsigned long nr_dom, i;
+ int found = 0;
+
+- ASSERT(spin_is_locked(&iommu->lock));
++ ASSERT(pcidevs_locked());
+
+ nr_dom = cap_ndoms(iommu->cap);
+ i = find_first_bit(iommu->domid_bitmap, nr_dom);
+@@ -117,8 +128,13 @@ static int context_set_domain_id(struct
+ }
+
+ set_bit(i, iommu->domid_bitmap);
+- context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+- context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
++
++ if ( context )
++ {
++ context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
++ context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
++ }
++
+ return 0;
+ }
+
+@@ -168,8 +184,12 @@ static void check_cleanup_domid_map(stru
+ const struct pci_dev *exclude,
+ struct vtd_iommu *iommu)
+ {
+- bool found = any_pdev_behind_iommu(d, exclude, iommu);
++ bool found;
++
++ if ( d == dom_io )
++ return;
+
++ found = any_pdev_behind_iommu(d, exclude, iommu);
+ /*
+ * Hidden devices are associated with DomXEN but usable by the hardware
+ * domain. Hence they need considering here as well.
+@@ -1414,7 +1434,7 @@ int domain_context_mapping_one(
+ domid = iommu->domid_map[prev_did];
+ if ( domid < DOMID_FIRST_RESERVED )
+ prev_dom = rcu_lock_domain_by_id(domid);
+- else if ( domid == DOMID_IO )
++ else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
+ prev_dom = rcu_lock_domain(dom_io);
+ if ( !prev_dom )
+ {
+@@ -1570,15 +1590,12 @@ int domain_context_mapping_one(
+ {
+ if ( !prev_dom )
+ ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+- domain->domain_id);
++ DEVICE_DOMID(domain, pdev));
+ else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+- {
+- hd = dom_iommu(prev_dom);
+ ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+- domain->domain_id,
+- hd->arch.vtd.pgd_maddr,
++ DEVICE_DOMID(prev_dom, pdev),
++ DEVICE_PGTABLE(prev_dom, pdev),
+ mode & MAP_WITH_RMRR) < 0;
+- }
+ else
+ ret = 1;
+
+@@ -1600,7 +1617,7 @@ static int domain_context_mapping(struct
+ {
+ struct acpi_drhd_unit *drhd;
+ const struct acpi_rmrr_unit *rmrr;
+- paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
++ paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
+ domid_t orig_domid = pdev->arch.pseudo_domid;
+ int ret = 0;
+ unsigned int i, mode = 0;
+@@ -1633,7 +1650,7 @@ static int domain_context_mapping(struct
+ break;
+ }
+
+- if ( domain != pdev->domain )
++ if ( domain != pdev->domain && pdev->domain != dom_io )
+ {
+ if ( pdev->domain->is_dying )
+ mode |= MAP_OWNER_DYING;
+@@ -1672,8 +1689,8 @@ static int domain_context_mapping(struct
+ if ( iommu_debug )
+ printk(VTDPREFIX "%pd:PCIe: map %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+- ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- pdev, domain->domain_id, pgd_maddr,
++ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
++ DEVICE_DOMID(domain, pdev), pgd_maddr,
+ mode);
+ if ( ret > 0 )
+ ret = 0;
+@@ -1696,8 +1713,8 @@ static int domain_context_mapping(struct
+ domain, &PCI_SBDF3(seg, bus, devfn));
+
+ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- pdev, domain->domain_id, pgd_maddr,
+- mode);
++ pdev, DEVICE_DOMID(domain, pdev),
++ pgd_maddr, mode);
+ if ( ret < 0 )
+ break;
+ prev_present = ret;
+@@ -1725,8 +1742,8 @@ static int domain_context_mapping(struct
+ */
+ if ( ret >= 0 )
+ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+- NULL, domain->domain_id, pgd_maddr,
+- mode);
++ NULL, DEVICE_DOMID(domain, pdev),
++ pgd_maddr, mode);
+
+ /*
+ * Devices behind PCIe-to-PCI/PCIx bridge may generate different
+@@ -1741,8 +1758,8 @@ static int domain_context_mapping(struct
+ if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
+ (secbus != pdev->bus || pdev->devfn != 0) )
+ ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
+- NULL, domain->domain_id, pgd_maddr,
+- mode);
++ NULL, DEVICE_DOMID(domain, pdev),
++ pgd_maddr, mode);
+
+ if ( ret )
+ {
+@@ -1889,7 +1906,7 @@ static const struct acpi_drhd_unit *doma
+ printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+ ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+- domain->domain_id);
++ DEVICE_DOMID(domain, pdev));
+ if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
+ disable_ats_device(pdev);
+
+@@ -1900,7 +1917,7 @@ static const struct acpi_drhd_unit *doma
+ printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
+ domain, &PCI_SBDF3(seg, bus, devfn));
+ ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+- domain->domain_id);
++ DEVICE_DOMID(domain, pdev));
+ if ( ret )
+ break;
+
+@@ -1923,18 +1940,12 @@ static const struct acpi_drhd_unit *doma
+ break;
+ }
+
++ ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
++ DEVICE_DOMID(domain, pdev));
+ /* PCIe to PCI/PCIx bridge */
+- if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+- {
+- ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+- domain->domain_id);
+- if ( !ret )
+- ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+- domain->domain_id);
+- }
+- else /* Legacy PCI bridge */
+- ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+- domain->domain_id);
++ if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
++ ret = domain_context_unmap_one(domain, iommu, secbus, 0,
++ DEVICE_DOMID(domain, pdev));
+
+ break;
+
+@@ -1977,6 +1988,26 @@ static void iommu_domain_teardown(struct
+ cleanup_domid_map(d->domain_id, drhd->iommu);
+ }
+
++static void quarantine_teardown(struct pci_dev *pdev,
++ const struct acpi_drhd_unit *drhd)
++{
++ struct domain_iommu *hd = dom_iommu(dom_io);
++
++ ASSERT(pcidevs_locked());
++
++ if ( !pdev->arch.vtd.pgd_maddr )
++ return;
++
++ ASSERT(page_list_empty(&hd->arch.pgtables.list));
++ page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
++ while ( iommu_free_pgtables(dom_io) == -ERESTART )
++ /* nothing */;
++ pdev->arch.vtd.pgd_maddr = 0;
++
++ if ( drhd )
++ cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
++}
++
+ static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
+ mfn_t mfn, unsigned int flags,
+ unsigned int *flush_flags)
+@@ -2201,6 +2232,8 @@ static int intel_iommu_remove_device(u8
+ rmrr->end_address, 0);
+ }
+
++ quarantine_teardown(pdev, drhd);
++
+ if ( drhd )
+ {
+ iommu_free_domid(pdev->arch.pseudo_domid,
+@@ -2896,69 +2929,135 @@ static void vtd_dump_page_tables(struct
+ agaw_to_level(hd->arch.vtd.agaw), 0, 0);
+ }
+
+-static int __init intel_iommu_quarantine_init(struct domain *d)
++static int fill_qpt(struct dma_pte *this, unsigned int level,
++ struct page_info *pgs[6])
+ {
+- struct domain_iommu *hd = dom_iommu(d);
++ struct domain_iommu *hd = dom_iommu(dom_io);
++ unsigned int i;
++ int rc = 0;
++
++ for ( i = 0; !rc && i < PTE_NUM; ++i )
++ {
++ struct dma_pte *pte = &this[i], *next;
++
++ if ( !dma_pte_present(*pte) )
++ {
++ if ( !pgs[level] )
++ {
++ /*
++ * The pgtable allocator is fine for the leaf page, as well as
++ * page table pages, and the resulting allocations are always
++ * zeroed.
++ */
++ pgs[level] = iommu_alloc_pgtable(hd);
++ if ( !pgs[level] )
++ {
++ rc = -ENOMEM;
++ break;
++ }
++
++ if ( level )
++ {
++ next = map_vtd_domain_page(page_to_maddr(pgs[level]));
++ rc = fill_qpt(next, level - 1, pgs);
++ unmap_vtd_domain_page(next);
++ }
++ }
++
++ dma_set_pte_addr(*pte, page_to_maddr(pgs[level]));
++ dma_set_pte_readable(*pte);
++ dma_set_pte_writable(*pte);
++ }
++ else if ( level && !dma_pte_superpage(*pte) )
++ {
++ next = map_vtd_domain_page(dma_pte_addr(*pte));
++ rc = fill_qpt(next, level - 1, pgs);
++ unmap_vtd_domain_page(next);
++ }
++ }
++
++ return rc;
++}
++
++static int intel_iommu_quarantine_init(struct pci_dev *pdev)
++{
++ struct domain_iommu *hd = dom_iommu(dom_io);
+ struct page_info *pg;
+- struct dma_pte *parent;
+ unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
+ unsigned int level = agaw_to_level(agaw);
+- int rc = 0;
++ const struct acpi_drhd_unit *drhd;
++ const struct acpi_rmrr_unit *rmrr;
++ unsigned int i, bdf;
++ bool rmrr_found = false;
++ int rc;
+
+- spin_lock(&hd->arch.mapping_lock);
++ ASSERT(pcidevs_locked());
++ ASSERT(!hd->arch.vtd.pgd_maddr);
++ ASSERT(page_list_empty(&hd->arch.pgtables.list));
+
+- if ( hd->arch.vtd.pgd_maddr )
++ if ( pdev->arch.vtd.pgd_maddr )
+ {
+- ASSERT_UNREACHABLE();
+- goto out;
++ clear_domain_page(pdev->arch.leaf_mfn);
++ return 0;
+ }
+
+- pg = iommu_alloc_pgtable(hd);
++ drhd = acpi_find_matched_drhd_unit(pdev);
++ if ( !drhd )
++ return -ENODEV;
+
+- rc = -ENOMEM;
++ pg = iommu_alloc_pgtable(hd);
+ if ( !pg )
+- goto out;
++ return -ENOMEM;
+
++ rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
++
++ /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+ hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
+
+- parent = map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
+- while ( level )
++ for_each_rmrr_device ( rmrr, bdf, i )
+ {
+- uint64_t maddr;
+- unsigned int offset;
+-
+- /*
+- * The pgtable allocator is fine for the leaf page, as well as
+- * page table pages, and the resulting allocations are always
+- * zeroed.
+- */
+- pg = iommu_alloc_pgtable(hd);
+-
+- if ( !pg )
+- goto out;
++ if ( rc )
++ break;
+
+- maddr = page_to_maddr(pg);
+- for ( offset = 0; offset < PTE_NUM; offset++ )
++ if ( rmrr->segment == pdev->seg && bdf == pdev->sbdf.bdf )
+ {
+- struct dma_pte *pte = &parent[offset];
++ rmrr_found = true;
+
+- dma_set_pte_addr(*pte, maddr);
+- dma_set_pte_readable(*pte);
++ rc = iommu_identity_mapping(dom_io, p2m_access_rw,
++ rmrr->base_address, rmrr->end_address,
++ 0);
++ if ( rc )
++ printk(XENLOG_ERR VTDPREFIX
++ "%pp: RMRR quarantine mapping failed\n",
++ &pdev->sbdf);
+ }
+- iommu_sync_cache(parent, PAGE_SIZE);
++ }
+
+- unmap_vtd_domain_page(parent);
+- parent = map_vtd_domain_page(maddr);
+- level--;
++ iommu_identity_map_teardown(dom_io);
++ hd->arch.vtd.pgd_maddr = 0;
++ pdev->arch.vtd.pgd_maddr = page_to_maddr(pg);
++
++ if ( !rc )
++ {
++ struct dma_pte *root;
++ struct page_info *pgs[6] = {};
++
++ spin_lock(&hd->arch.mapping_lock);
++
++ root = map_vtd_domain_page(pdev->arch.vtd.pgd_maddr);
++ rc = fill_qpt(root, level - 1, pgs);
++ unmap_vtd_domain_page(root);
++
++ pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
++
++ spin_unlock(&hd->arch.mapping_lock);
+ }
+- unmap_vtd_domain_page(parent);
+
+- rc = 0;
++ page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
+
+- out:
+- spin_unlock(&hd->arch.mapping_lock);
++ if ( rc )
++ quarantine_teardown(pdev, drhd);
+
+- /* Pages may be leaked in failure case */
+ return rc;
+ }
+
+--- xen/drivers/passthrough/vtd/iommu.h.orig
++++ xen/drivers/passthrough/vtd/iommu.h
+@@ -509,7 +509,7 @@ struct vtd_iommu {
+ u32 nr_pt_levels;
+ u64 cap;
+ u64 ecap;
+- spinlock_t lock; /* protect context, domain ids */
++ spinlock_t lock; /* protect context */
+ spinlock_t register_lock; /* protect iommu register handling */
+ u64 root_maddr; /* root entry machine address */
+ nodeid_t node;
+--- xen/include/xen/iommu.h.orig
++++ xen/include/xen/iommu.h
+@@ -234,7 +234,7 @@ typedef int iommu_grdm_t(xen_pfn_t start
+ struct iommu_ops {
+ int (*init)(struct domain *d);
+ void (*hwdom_init)(struct domain *d);
+- int (*quarantine_init)(struct domain *d);
++ int (*quarantine_init)(device_t *dev);
+ int (*add_device)(u8 devfn, device_t *dev);
+ int (*enable_device)(device_t *dev);
+ int (*remove_device)(u8 devfn, device_t *dev);
+@@ -352,6 +352,7 @@ int __must_check iommu_suspend(void);
+ void iommu_resume(void);
+ void iommu_crash_shutdown(void);
+ int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
++int iommu_quarantine_dev_init(device_t *dev);
+
+ #ifdef CONFIG_HAS_PCI
+ int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d,
Index: pkgsrc/sysutils/xenkernel415/patches/patch-XSA401
diff -u /dev/null pkgsrc/sysutils/xenkernel415/patches/patch-XSA401:1.1
--- /dev/null Fri Jun 24 13:07:52 2022
+++ pkgsrc/sysutils/xenkernel415/patches/patch-XSA401 Fri Jun 24 13:07:52 2022
@@ -0,0 +1,363 @@
+$NetBSD: patch-XSA401,v 1.1 2022/06/24 13:07:52 bouyer Exp $
+
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/pv: Clean up _get_page_type()
+
+Various fixes for clarity, ahead of making complicated changes.
+
+ * Split the overflow check out of the if/else chain for type handling, as
+ it's somewhat unrelated.
+ * Comment the main if/else chain to explain what is going on. Adjust one
+ ASSERT() and state the bit layout for validate-locked and partial states.
+ * Correct the comment about TLB flushing, as it's backwards. The problem
+ case is when writeable mappings are retained to a page becoming read-only,
+ as it allows the guest to bypass Xen's safety checks for updates.
+ * Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not
+ valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat
+ all reads as explicitly volatile. The only thing preventing the validated
+ wait-loop being infinite is the compiler barrier hidden in cpu_relax().
+ * Replace one page_get_owner(page) with the already-calculated 'd' already in
+ scope.
+
+No functional change.
+
+This is part of XSA-401 / CVE-2022-26362.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Signed-off-by: George Dunlap <george.dunlap%eu.citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: George Dunlap <george.dunlap%citrix.com@localhost>
+
+diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
+index 796faca64103..ddd32f88c798 100644
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -2935,16 +2935,17 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
+ static int _get_page_type(struct page_info *page, unsigned long type,
+ bool preemptible)
+ {
+- unsigned long nx, x, y = page->u.inuse.type_info;
++ unsigned long nx, x;
+ int rc = 0;
+
+ ASSERT(!(type & ~(PGT_type_mask | PGT_pae_xen_l2)));
+ ASSERT(!in_irq());
+
+- for ( ; ; )
++ for ( unsigned long y = ACCESS_ONCE(page->u.inuse.type_info); ; )
+ {
+ x = y;
+ nx = x + 1;
++
+ if ( unlikely((nx & PGT_count_mask) == 0) )
+ {
+ gdprintk(XENLOG_WARNING,
+@@ -2952,8 +2953,15 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+ mfn_x(page_to_mfn(page)));
+ return -EINVAL;
+ }
+- else if ( unlikely((x & PGT_count_mask) == 0) )
++
++ if ( unlikely((x & PGT_count_mask) == 0) )
+ {
++ /*
++ * Typeref 0 -> 1.
++ *
++ * Type changes are permitted when the typeref is 0. If the type
++ * actually changes, the page needs re-validating.
++ */
+ struct domain *d = page_get_owner(page);
+
+ if ( d && shadow_mode_enabled(d) )
+@@ -2964,8 +2972,8 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+ {
+ /*
+ * On type change we check to flush stale TLB entries. It is
+- * vital that no other CPUs are left with mappings of a frame
+- * which is about to become writeable to the guest.
++ * vital that no other CPUs are left with writeable mappings
++ * to a frame which is intending to become pgtable/segdesc.
+ */
+ cpumask_t *mask = this_cpu(scratch_cpumask);
+
+@@ -2977,7 +2985,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+
+ if ( unlikely(!cpumask_empty(mask)) &&
+ /* Shadow mode: track only writable pages. */
+- (!shadow_mode_enabled(page_get_owner(page)) ||
++ (!shadow_mode_enabled(d) ||
+ ((nx & PGT_type_mask) == PGT_writable_page)) )
+ {
+ perfc_incr(need_flush_tlb_flush);
+@@ -3008,7 +3016,14 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+ }
+ else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
+ {
+- /* Don't log failure if it could be a recursive-mapping attempt. */
++ /*
++ * else, we're trying to take a new reference, of the wrong type.
++ *
++ * This (being able to prohibit use of the wrong type) is what the
++ * typeref system exists for, but skip printing the failure if it
++ * looks like a recursive mapping, as subsequent logic might
++ * ultimately permit the attempt.
++ */
+ if ( ((x & PGT_type_mask) == PGT_l2_page_table) &&
+ (type == PGT_l1_page_table) )
+ return -EINVAL;
+@@ -3027,18 +3042,46 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+ }
+ else if ( unlikely(!(x & PGT_validated)) )
+ {
++ /*
++ * else, the count is non-zero, and we're grabbing the right type;
++ * but the page hasn't been validated yet.
++ *
++ * The page is in one of two states (depending on PGT_partial),
++ * and should have exactly one reference.
++ */
++ ASSERT((x & (PGT_type_mask | PGT_count_mask)) == (type | 1));
++
+ if ( !(x & PGT_partial) )
+ {
+- /* Someone else is updating validation of this page. Wait... */
++ /*
++ * The page has been left in the "validate locked" state
++ * (i.e. PGT_[type] | 1) which means that a concurrent caller
++ * of _get_page_type() is in the middle of validation.
++ *
++ * Spin waiting for the concurrent user to complete (partial
++ * or fully validated), then restart our attempt to acquire a
++ * type reference.
++ */
+ do {
+ if ( preemptible && hypercall_preempt_check() )
+ return -EINTR;
+ cpu_relax();
+- } while ( (y = page->u.inuse.type_info) == x );
++ } while ( (y = ACCESS_ONCE(page->u.inuse.type_info)) == x );
+ continue;
+ }
+- /* Type ref count was left at 1 when PGT_partial got set. */
+- ASSERT((x & PGT_count_mask) == 1);
++
++ /*
++ * The page has been left in the "partial" state
++ * (i.e., PGT_[type] | PGT_partial | 1).
++ *
++ * Rather than bumping the type count, we need to try to grab the
++ * validation lock; if we succeed, we need to validate the page,
++ * then drop the general ref associated with the PGT_partial bit.
++ *
++ * We grab the validation lock by setting nx to (PGT_[type] | 1)
++ * (i.e., non-zero type count, neither PGT_validated nor
++ * PGT_partial set).
++ */
+ nx = x & ~PGT_partial;
+ }
+
+@@ -3087,6 +3130,13 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+ }
+
+ out:
++ /*
++ * Did we drop the PGT_partial bit when acquiring the typeref? If so,
++ * drop the general reference that went along with it.
++ *
++ * N.B. validate_page() may have have re-set PGT_partial, not reflected in
++ * nx, but will have taken an extra ref when doing so.
++ */
+ if ( (x & PGT_partial) && !(nx & PGT_partial) )
+ put_page(page);
+
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/pv: Fix ABAC cmpxchg() race in _get_page_type()
+
+_get_page_type() suffers from a race condition where it incorrectly assumes
+that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
+cannot have changed in-between. Consider:
+
+CPU A:
+ 1. Creates an L2e referencing pg
+ `-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
+ 2. Issues flush_tlb_mask()
+CPU B:
+ 3. Creates a writeable mapping of pg
+ `-> _get_page_type(pg, PGT_writable_page), count increases to 1
+ 4. Writes into new mapping, creating a TLB entry for pg
+ 5. Removes the writeable mapping of pg
+ `-> _put_page_type(pg), count goes back down to 0
+CPU A:
+ 7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table
+
+CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
+suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred
+until after the guest is prohibited from creating new writeable mappings,
+which is after step 7.
+
+Defer all safety actions until after the cmpxchg() has successfully taken the
+intended typeref, because that is what prevents concurrent users from using
+the old type.
+
+Also remove the early validation for writeable and shared pages. This removes
+race conditions where one half of a parallel mapping attempt can return
+successfully before:
+ * The IOMMU pagetables are in sync with the new page type
+ * Writeable mappings to shared pages have been torn down
+
+This is part of XSA-401 / CVE-2022-26362.
+
+Reported-by: Jann Horn <jannh%google.com@localhost>
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: George Dunlap <george.dunlap%citrix.com@localhost>
+
+diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
+index ddd32f88c798..1693b580b152 100644
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -2962,56 +2962,12 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+ * Type changes are permitted when the typeref is 0. If the type
+ * actually changes, the page needs re-validating.
+ */
+- struct domain *d = page_get_owner(page);
+-
+- if ( d && shadow_mode_enabled(d) )
+- shadow_prepare_page_type_change(d, page, type);
+
+ ASSERT(!(x & PGT_pae_xen_l2));
+ if ( (x & PGT_type_mask) != type )
+ {
+- /*
+- * On type change we check to flush stale TLB entries. It is
+- * vital that no other CPUs are left with writeable mappings
+- * to a frame which is intending to become pgtable/segdesc.
+- */
+- cpumask_t *mask = this_cpu(scratch_cpumask);
+-
+- BUG_ON(in_irq());
+- cpumask_copy(mask, d->dirty_cpumask);
+-
+- /* Don't flush if the timestamp is old enough */
+- tlbflush_filter(mask, page->tlbflush_timestamp);
+-
+- if ( unlikely(!cpumask_empty(mask)) &&
+- /* Shadow mode: track only writable pages. */
+- (!shadow_mode_enabled(d) ||
+- ((nx & PGT_type_mask) == PGT_writable_page)) )
+- {
+- perfc_incr(need_flush_tlb_flush);
+- /*
+- * If page was a page table make sure the flush is
+- * performed using an IPI in order to avoid changing the
+- * type of a page table page under the feet of
+- * spurious_page_fault().
+- */
+- flush_mask(mask,
+- (x & PGT_type_mask) &&
+- (x & PGT_type_mask) <= PGT_root_page_table
+- ? FLUSH_TLB | FLUSH_FORCE_IPI
+- : FLUSH_TLB);
+- }
+-
+- /* We lose existing type and validity. */
+ nx &= ~(PGT_type_mask | PGT_validated);
+ nx |= type;
+-
+- /*
+- * No special validation needed for writable pages.
+- * Page tables and GDT/LDT need to be scanned for validity.
+- */
+- if ( type == PGT_writable_page || type == PGT_shared_page )
+- nx |= PGT_validated;
+ }
+ }
+ else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
+@@ -3092,6 +3048,56 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+ return -EINTR;
+ }
+
++ /*
++ * One typeref has been taken and is now globally visible.
++ *
++ * The page is either in the "validate locked" state (PGT_[type] | 1) or
++ * fully validated (PGT_[type] | PGT_validated | >0).
++ */
++
++ if ( unlikely((x & PGT_count_mask) == 0) )
++ {
++ struct domain *d = page_get_owner(page);
++
++ if ( d && shadow_mode_enabled(d) )
++ shadow_prepare_page_type_change(d, page, type);
++
++ if ( (x & PGT_type_mask) != type )
++ {
++ /*
++ * On type change we check to flush stale TLB entries. It is
++ * vital that no other CPUs are left with writeable mappings
++ * to a frame which is intending to become pgtable/segdesc.
++ */
++ cpumask_t *mask = this_cpu(scratch_cpumask);
++
++ BUG_ON(in_irq());
++ cpumask_copy(mask, d->dirty_cpumask);
++
++ /* Don't flush if the timestamp is old enough */
++ tlbflush_filter(mask, page->tlbflush_timestamp);
++
++ if ( unlikely(!cpumask_empty(mask)) &&
++ /* Shadow mode: track only writable pages. */
++ (!shadow_mode_enabled(d) ||
++ ((nx & PGT_type_mask) == PGT_writable_page)) )
++ {
++ perfc_incr(need_flush_tlb_flush);
++ /*
++ * If page was a page table make sure the flush is
++ * performed using an IPI in order to avoid changing the
++ * type of a page table page under the feet of
++ * spurious_page_fault().
++ */
++ flush_mask(mask,
++ (x & PGT_type_mask) &&
++ (x & PGT_type_mask) <= PGT_root_page_table
++ ? FLUSH_TLB | FLUSH_FORCE_IPI
++ : FLUSH_TLB);
++ }
++ }
++ }
++
+ if ( unlikely(((x & PGT_type_mask) == PGT_writable_page) !=
+ (type == PGT_writable_page)) )
+ {
+@@ -3120,13 +3126,25 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+
+ if ( unlikely(!(nx & PGT_validated)) )
+ {
+- if ( !(x & PGT_partial) )
++ /*
++ * No special validation needed for writable or shared pages. Page
++ * tables and GDT/LDT need to have their contents audited.
++ *
++ * per validate_page(), non-atomic updates are fine here.
++ */
++ if ( type == PGT_writable_page || type == PGT_shared_page )
++ page->u.inuse.type_info |= PGT_validated;
++ else
+ {
+- page->nr_validated_ptes = 0;
+- page->partial_flags = 0;
+- page->linear_pt_count = 0;
++ if ( !(x & PGT_partial) )
++ {
++ page->nr_validated_ptes = 0;
++ page->partial_flags = 0;
++ page->linear_pt_count = 0;
++ }
++
++ rc = validate_page(page, type, preemptible);
+ }
+- rc = validate_page(page, type, preemptible);
+ }
+
+ out:
Index: pkgsrc/sysutils/xenkernel415/patches/patch-XSA402
diff -u /dev/null pkgsrc/sysutils/xenkernel415/patches/patch-XSA402:1.1
--- /dev/null Fri Jun 24 13:07:52 2022
+++ pkgsrc/sysutils/xenkernel415/patches/patch-XSA402 Fri Jun 24 13:07:52 2022
@@ -0,0 +1,773 @@
+$NetBSD: patch-XSA402,v 1.1 2022/06/24 13:07:52 bouyer Exp $
+
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/page: Introduce _PAGE_* constants for memory types
+
+... rather than opencoding the PAT/PCD/PWT attributes in __PAGE_HYPERVISOR_*
+constants. These are going to be needed by forthcoming logic.
+
+No functional change.
+
+This is part of XSA-402.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+
+diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h
+index 4c7f2cb70c69..534bc1f403b3 100644
+--- xen/include/asm-x86/page.h.orig
++++ xen/include/asm-x86/page.h
+@@ -336,6 +336,14 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t);
+
+ #define PAGE_CACHE_ATTRS (_PAGE_PAT | _PAGE_PCD | _PAGE_PWT)
+
++/* Memory types, encoded under Xen's choice of MSR_PAT. */
++#define _PAGE_WB ( 0)
++#define _PAGE_WT ( _PAGE_PWT)
++#define _PAGE_UCM ( _PAGE_PCD )
++#define _PAGE_UC ( _PAGE_PCD | _PAGE_PWT)
++#define _PAGE_WC (_PAGE_PAT )
++#define _PAGE_WP (_PAGE_PAT | _PAGE_PWT)
++
+ /*
+ * Debug option: Ensure that granted mappings are not implicitly unmapped.
+ * WARNING: This will need to be disabled to run OSes that use the spare PTE
+@@ -354,8 +362,8 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t);
+ #define __PAGE_HYPERVISOR_RX (_PAGE_PRESENT | _PAGE_ACCESSED)
+ #define __PAGE_HYPERVISOR (__PAGE_HYPERVISOR_RX | \
+ _PAGE_DIRTY | _PAGE_RW)
+-#define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD)
+-#define __PAGE_HYPERVISOR_UC (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
++#define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_UCM)
++#define __PAGE_HYPERVISOR_UC (__PAGE_HYPERVISOR | _PAGE_UC)
+ #define __PAGE_HYPERVISOR_SHSTK (__PAGE_HYPERVISOR_RO | _PAGE_DIRTY)
+
+ #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86: Don't change the cacheability of the directmap
+
+Changeset 55f97f49b7ce ("x86: Change cache attributes of Xen 1:1 page mappings
+in response to guest mapping requests") attempted to keep the cacheability
+consistent between different mappings of the same page.
+
+The reason wasn't described in the changelog, but it is understood to be in
+regards to a concern over machine check exceptions, owing to errata when using
+mixed cacheabilities. It did this primarily by updating Xen's mapping of the
+page in the direct map when the guest mapped a page with reduced cacheability.
+
+Unfortunately, the logic didn't actually prevent mixed cacheability from
+occurring:
+ * A guest could map a page normally, and then map the same page with
+ different cacheability; nothing prevented this.
+ * The cacheability of the directmap was always latest-takes-precedence in
+ terms of guest requests.
+ * Grant-mapped frames with lesser cacheability didn't adjust the page's
+ cacheattr settings.
+ * The map_domain_page() function still unconditionally created WB mappings,
+ irrespective of the page's cacheattr settings.
+
+Additionally, update_xen_mappings() had a bug where the alias calculation was
+wrong for mfn's which were .init content, which should have been treated as
+fully guest pages, not Xen pages.
+
+Worse yet, the logic introduced a vulnerability whereby necessary
+pagetable/segdesc adjustments made by Xen in the validation logic could become
+non-coherent between the cache and main memory. The CPU could subsequently
+operate on the stale value in the cache, rather than the safe value in main
+memory.
+
+The directmap contains primarily mappings of RAM. PAT/MTRR conflict
+resolution is asymmetric, and generally for MTRR=WB ranges, PAT of lesser
+cacheability resolves to being coherent. The special case is WC mappings,
+which are non-coherent against MTRR=WB regions (except for fully-coherent
+CPUs).
+
+Xen must not have any WC cacheability in the directmap, to prevent Xen's
+actions from creating non-coherency. (Guest actions creating non-coherency is
+dealt with in subsequent patches.) As all memory types for MTRR=WB ranges
+inter-operate coherently, so leave Xen's directmap mappings as WB.
+
+Only PV guests with access to devices can use reduced-cacheability mappings to
+begin with, and they're trusted not to mount DoSs against the system anyway.
+
+Drop PGC_cacheattr_{base,mask} entirely, and the logic to manipulate them.
+Shift the later PGC_* constants up, to gain 3 extra bits in the main reference
+count. Retain the check in get_page_from_l1e() for special_pages() because a
+guest has no business using reduced cacheability on these.
+
+This reverts changeset 55f97f49b7ce6c3520c555d19caac6cf3f9a5df0
+
+This is CVE-2022-26363, part of XSA-402.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: George Dunlap <george.dunlap%citrix.com@localhost>
+
+diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
+index 2644b9f0337c..6ce8c19dcecc 100644
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -783,28 +783,6 @@ bool is_iomem_page(mfn_t mfn)
+ return (page_get_owner(page) == dom_io);
+ }
+
+-static int update_xen_mappings(unsigned long mfn, unsigned int cacheattr)
+-{
+- int err = 0;
+- bool alias = mfn >= PFN_DOWN(xen_phys_start) &&
+- mfn < PFN_UP(xen_phys_start + xen_virt_end - XEN_VIRT_START);
+- unsigned long xen_va =
+- XEN_VIRT_START + ((mfn - PFN_DOWN(xen_phys_start)) << PAGE_SHIFT);
+-
+- if ( boot_cpu_has(X86_FEATURE_XEN_SELFSNOOP) )
+- return 0;
+-
+- if ( unlikely(alias) && cacheattr )
+- err = map_pages_to_xen(xen_va, _mfn(mfn), 1, 0);
+- if ( !err )
+- err = map_pages_to_xen((unsigned long)mfn_to_virt(mfn), _mfn(mfn), 1,
+- PAGE_HYPERVISOR | cacheattr_to_pte_flags(cacheattr));
+- if ( unlikely(alias) && !cacheattr && !err )
+- err = map_pages_to_xen(xen_va, _mfn(mfn), 1, PAGE_HYPERVISOR);
+-
+- return err;
+-}
+-
+ #ifndef NDEBUG
+ struct mmio_emul_range_ctxt {
+ const struct domain *d;
+@@ -1009,47 +987,14 @@ get_page_from_l1e(
+ goto could_not_pin;
+ }
+
+- if ( pte_flags_to_cacheattr(l1f) !=
+- ((page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base) )
++ if ( (l1f & PAGE_CACHE_ATTRS) != _PAGE_WB && is_special_page(page) )
+ {
+- unsigned long x, nx, y = page->count_info;
+- unsigned long cacheattr = pte_flags_to_cacheattr(l1f);
+- int err;
+-
+- if ( is_special_page(page) )
+- {
+- if ( write )
+- put_page_type(page);
+- put_page(page);
+- gdprintk(XENLOG_WARNING,
+- "Attempt to change cache attributes of Xen heap page\n");
+- return -EACCES;
+- }
+-
+- do {
+- x = y;
+- nx = (x & ~PGC_cacheattr_mask) | (cacheattr << PGC_cacheattr_base);
+- } while ( (y = cmpxchg(&page->count_info, x, nx)) != x );
+-
+- err = update_xen_mappings(mfn, cacheattr);
+- if ( unlikely(err) )
+- {
+- cacheattr = y & PGC_cacheattr_mask;
+- do {
+- x = y;
+- nx = (x & ~PGC_cacheattr_mask) | cacheattr;
+- } while ( (y = cmpxchg(&page->count_info, x, nx)) != x );
+-
+- if ( write )
+- put_page_type(page);
+- put_page(page);
+-
+- gdprintk(XENLOG_WARNING, "Error updating mappings for mfn %" PRI_mfn
+- " (pfn %" PRI_pfn ", from L1 entry %" PRIpte ") for d%d\n",
+- mfn, get_gpfn_from_mfn(mfn),
+- l1e_get_intpte(l1e), l1e_owner->domain_id);
+- return err;
+- }
++ if ( write )
++ put_page_type(page);
++ put_page(page);
++ gdprintk(XENLOG_WARNING,
++ "Attempt to change cache attributes of Xen heap page\n");
++ return -EACCES;
+ }
+
+ return 0;
+@@ -2455,25 +2400,10 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
+ */
+ static int cleanup_page_mappings(struct page_info *page)
+ {
+- unsigned int cacheattr =
+- (page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base;
+ int rc = 0;
+ unsigned long mfn = mfn_x(page_to_mfn(page));
+
+ /*
+- * If we've modified xen mappings as a result of guest cache
+- * attributes, restore them to the "normal" state.
+- */
+- if ( unlikely(cacheattr) )
+- {
+- page->count_info &= ~PGC_cacheattr_mask;
+-
+- BUG_ON(is_special_page(page));
+-
+- rc = update_xen_mappings(mfn, 0);
+- }
+-
+- /*
+ * If this may be in a PV domain's IOMMU, remove it.
+ *
+ * NB that writable xenheap pages have their type set and cleared by
+diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
+index 041c158f03f6..f5b8862b8374 100644
+--- xen/include/asm-x86/mm.h.orig
++++ xen/include/asm-x86/mm.h
+@@ -69,25 +69,22 @@
+ /* Set when is using a page as a page table */
+ #define _PGC_page_table PG_shift(3)
+ #define PGC_page_table PG_mask(1, 3)
+- /* 3-bit PAT/PCD/PWT cache-attribute hint. */
+-#define PGC_cacheattr_base PG_shift(6)
+-#define PGC_cacheattr_mask PG_mask(7, 6)
+ /* Page is broken? */
+-#define _PGC_broken PG_shift(7)
+-#define PGC_broken PG_mask(1, 7)
++#define _PGC_broken PG_shift(4)
++#define PGC_broken PG_mask(1, 4)
+ /* Mutually-exclusive page states: { inuse, offlining, offlined, free }. */
+-#define PGC_state PG_mask(3, 9)
+-#define PGC_state_inuse PG_mask(0, 9)
+-#define PGC_state_offlining PG_mask(1, 9)
+-#define PGC_state_offlined PG_mask(2, 9)
+-#define PGC_state_free PG_mask(3, 9)
++#define PGC_state PG_mask(3, 6)
++#define PGC_state_inuse PG_mask(0, 6)
++#define PGC_state_offlining PG_mask(1, 6)
++#define PGC_state_offlined PG_mask(2, 6)
++#define PGC_state_free PG_mask(3, 6)
+ #define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st)
+ /* Page is not reference counted */
+-#define _PGC_extra PG_shift(10)
+-#define PGC_extra PG_mask(1, 10)
++#define _PGC_extra PG_shift(7)
++#define PGC_extra PG_mask(1, 7)
+
+ /* Count of references to this frame. */
+-#define PGC_count_width PG_shift(10)
++#define PGC_count_width PG_shift(7)
+ #define PGC_count_mask ((1UL<<PGC_count_width)-1)
+
+ /*
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86: Split cache_flush() out of cache_writeback()
+
+Subsequent changes will want a fully flushing version.
+
+Use the new helper rather than opencoding it in flush_area_local(). This
+resolves an outstanding issue where the conditional sfence is on the wrong
+side of the clflushopt loop. clflushopt is ordered with respect to older
+stores, not to younger stores.
+
+Rename gnttab_cache_flush()'s helper to avoid colliding in name.
+grant_table.c can see the prototype from cache.h so the build fails
+otherwise.
+
+This is part of XSA-402.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+
+Xen 4.16 and earlier:
+ * Also backport half of c/s 3330013e67396 "VT-d / x86: re-arrange cache
+ syncing" to split cache_writeback() out of the IOMMU logic, but without the
+ associated hooks changes.
+
+diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
+index 25798df50f54..0c912b8669f8 100644
+--- xen/arch/x86/flushtlb.c.orig
++++ xen/arch/x86/flushtlb.c
+@@ -234,7 +234,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
+ if ( flags & FLUSH_CACHE )
+ {
+ const struct cpuinfo_x86 *c = ¤t_cpu_data;
+- unsigned long i, sz = 0;
++ unsigned long sz = 0;
+
+ if ( order < (BITS_PER_LONG - PAGE_SHIFT) )
+ sz = 1UL << (order + PAGE_SHIFT);
+@@ -244,13 +244,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
+ c->x86_clflush_size && c->x86_cache_size && sz &&
+ ((sz >> 10) < c->x86_cache_size) )
+ {
+- alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
+- for ( i = 0; i < sz; i += c->x86_clflush_size )
+- alternative_input(".byte " __stringify(NOP_DS_PREFIX) ";"
+- " clflush %0",
+- "data16 clflush %0", /* clflushopt */
+- X86_FEATURE_CLFLUSHOPT,
+- "m" (((const char *)va)[i]));
++ cache_flush(va, sz);
+ flags &= ~FLUSH_CACHE;
+ }
+ else
+@@ -265,6 +259,80 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
+ return flags;
+ }
+
++void cache_flush(const void *addr, unsigned int size)
++{
++ /*
++ * This function may be called before current_cpu_data is established.
++ * Hence a fallback is needed to prevent the loop below becoming infinite.
++ */
++ unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16;
++ const void *end = addr + size;
++
++ addr -= (unsigned long)addr & (clflush_size - 1);
++ for ( ; addr < end; addr += clflush_size )
++ {
++ /*
++ * Note regarding the "ds" prefix use: it's faster to do a clflush
++ * + prefix than a clflush + nop, and hence the prefix is added instead
++ * of letting the alternative framework fill the gap by appending nops.
++ */
++ alternative_io("ds; clflush %[p]",
++ "data16 clflush %[p]", /* clflushopt */
++ X86_FEATURE_CLFLUSHOPT,
++ /* no outputs */,
++ [p] "m" (*(const char *)(addr)));
++ }
++
++ alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
++}
++
++void cache_writeback(const void *addr, unsigned int size)
++{
++ unsigned int clflush_size;
++ const void *end = addr + size;
++
++ /* Fall back to CLFLUSH{,OPT} when CLWB isn't available. */
++ if ( !boot_cpu_has(X86_FEATURE_CLWB) )
++ return cache_flush(addr, size);
++
++ /*
++ * This function may be called before current_cpu_data is established.
++ * Hence a fallback is needed to prevent the loop below becoming infinite.
++ */
++ clflush_size = current_cpu_data.x86_clflush_size ?: 16;
++ addr -= (unsigned long)addr & (clflush_size - 1);
++ for ( ; addr < end; addr += clflush_size )
++ {
++/*
++ * The arguments to a macro must not include preprocessor directives. Doing so
++ * results in undefined behavior, so we have to create some defines here in
++ * order to avoid it.
++ */
++#if defined(HAVE_AS_CLWB)
++# define CLWB_ENCODING "clwb %[p]"
++#elif defined(HAVE_AS_XSAVEOPT)
++# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */
++#else
++# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */
++#endif
++
++#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr))
++#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT)
++# define INPUT BASE_INPUT
++#else
++# define INPUT(addr) "a" (addr), BASE_INPUT(addr)
++#endif
++
++ asm volatile (CLWB_ENCODING :: INPUT(addr));
++
++#undef INPUT
++#undef BASE_INPUT
++#undef CLWB_ENCODING
++ }
++
++ asm volatile ("sfence" ::: "memory");
++}
++
+ unsigned int guest_flush_tlb_flags(const struct domain *d)
+ {
+ bool shadow = paging_mode_shadow(d);
+diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
+index 47b019c75017..77bba9806937 100644
+--- xen/common/grant_table.c.orig
++++ xen/common/grant_table.c
+@@ -3423,7 +3423,7 @@ gnttab_swap_grant_ref(XEN_GUEST_HANDLE_PARAM(gnttab_swap_grant_ref_t) uop,
+ return 0;
+ }
+
+-static int cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref)
++static int _cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref)
+ {
+ struct domain *d, *owner;
+ struct page_info *page;
+@@ -3517,7 +3517,7 @@ gnttab_cache_flush(XEN_GUEST_HANDLE_PARAM(gnttab_cache_flush_t) uop,
+ return -EFAULT;
+ for ( ; ; )
+ {
+- int ret = cache_flush(&op, cur_ref);
++ int ret = _cache_flush(&op, cur_ref);
+
+ if ( ret < 0 )
+ return ret;
+diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
+index cf4d2218fa8b..8f70ae727b86 100644
+--- xen/drivers/passthrough/vtd/extern.h.orig
++++ xen/drivers/passthrough/vtd/extern.h
+@@ -76,7 +76,6 @@ int __must_check qinval_device_iotlb_sync(struct vtd_iommu *iommu,
+ struct pci_dev *pdev,
+ u16 did, u16 size, u64 addr);
+
+-unsigned int get_cache_line_size(void);
+ void flush_all_cache(void);
+
+ uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node);
+diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
+index a063462cff5a..68a658930a6a 100644
+--- xen/drivers/passthrough/vtd/iommu.c.orig
++++ xen/drivers/passthrough/vtd/iommu.c
+@@ -31,6 +31,7 @@
+ #include <xen/pci.h>
+ #include <xen/pci_regs.h>
+ #include <xen/keyhandler.h>
++#include <asm/cache.h>
+ #include <asm/msi.h>
+ #include <asm/nops.h>
+ #include <asm/irq.h>
+@@ -204,54 +205,6 @@ static void check_cleanup_domid_map(struct domain *d,
+ }
+ }
+
+-static void sync_cache(const void *addr, unsigned int size)
+-{
+- static unsigned long clflush_size = 0;
+- const void *end = addr + size;
+-
+- if ( clflush_size == 0 )
+- clflush_size = get_cache_line_size();
+-
+- addr -= (unsigned long)addr & (clflush_size - 1);
+- for ( ; addr < end; addr += clflush_size )
+-/*
+- * The arguments to a macro must not include preprocessor directives. Doing so
+- * results in undefined behavior, so we have to create some defines here in
+- * order to avoid it.
+- */
+-#if defined(HAVE_AS_CLWB)
+-# define CLWB_ENCODING "clwb %[p]"
+-#elif defined(HAVE_AS_XSAVEOPT)
+-# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */
+-#else
+-# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */
+-#endif
+-
+-#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr))
+-#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT)
+-# define INPUT BASE_INPUT
+-#else
+-# define INPUT(addr) "a" (addr), BASE_INPUT(addr)
+-#endif
+- /*
+- * Note regarding the use of NOP_DS_PREFIX: it's faster to do a clflush
+- * + prefix than a clflush + nop, and hence the prefix is added instead
+- * of letting the alternative framework fill the gap by appending nops.
+- */
+- alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %[p]",
+- "data16 clflush %[p]", /* clflushopt */
+- X86_FEATURE_CLFLUSHOPT,
+- CLWB_ENCODING,
+- X86_FEATURE_CLWB, /* no outputs */,
+- INPUT(addr));
+-#undef INPUT
+-#undef BASE_INPUT
+-#undef CLWB_ENCODING
+-
+- alternative_2("", "sfence", X86_FEATURE_CLFLUSHOPT,
+- "sfence", X86_FEATURE_CLWB);
+-}
+-
+ /* Allocate page table, return its machine address */
+ uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node)
+ {
+@@ -271,7 +224,7 @@ uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node)
+ clear_page(vaddr);
+
+ if ( (iommu_ops.init ? &iommu_ops : &vtd_ops)->sync_cache )
+- sync_cache(vaddr, PAGE_SIZE);
++ cache_writeback(vaddr, PAGE_SIZE);
+ unmap_domain_page(vaddr);
+ cur_pg++;
+ }
+@@ -1252,7 +1252,7 @@
+ iommu->nr_pt_levels = agaw_to_level(agaw);
+
+ if ( !ecap_coherent(iommu->ecap) )
+- vtd_ops.sync_cache = sync_cache;
++ vtd_ops.sync_cache = cache_writeback;
+
+ /* allocate domain id bitmap */
+ nr_dom = cap_ndoms(iommu->cap);
+diff --git a/xen/drivers/passthrough/vtd/x86/vtd.c b/xen/drivers/passthrough/vtd/x86/vtd.c
+index 6681dccd6970..55f0faa521cb 100644
+--- xen/drivers/passthrough/vtd/x86/vtd.c.orig
++++ xen/drivers/passthrough/vtd/x86/vtd.c
+@@ -47,11 +47,6 @@ void unmap_vtd_domain_page(const void *va)
+ unmap_domain_page(va);
+ }
+
+-unsigned int get_cache_line_size(void)
+-{
+- return ((cpuid_ebx(1) >> 8) & 0xff) * 8;
+-}
+-
+ void flush_all_cache()
+ {
+ wbinvd();
+diff --git a/xen/include/asm-x86/cache.h b/xen/include/asm-x86/cache.h
+index 1f7173d8c72c..e4770efb22b9 100644
+--- xen/include/asm-x86/cache.h.orig
++++ xen/include/asm-x86/cache.h
+@@ -11,4 +11,11 @@
+
+ #define __read_mostly __section(".data.read_mostly")
+
++#ifndef __ASSEMBLY__
++
++void cache_flush(const void *addr, unsigned int size);
++void cache_writeback(const void *addr, unsigned int size);
++
++#endif
++
+ #endif
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/amd: Work around CLFLUSH ordering on older parts
+
+On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakely ordered with everything,
+including reads and writes to the address, and LFENCE/SFENCE instructions.
+
+This creates a multitude of problematic corner cases, laid out in the manual.
+Arrange to use MFENCE on both sides of the CLFLUSH to force proper ordering.
+
+This is part of XSA-402.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+
+diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
+index 1ee687d0d224..986672a072b7 100644
+--- xen/arch/x86/cpu/amd.c.orig
++++ xen/arch/x86/cpu/amd.c
+@@ -787,6 +787,14 @@ static void init_amd(struct cpuinfo_x86 *c)
+ if (!cpu_has_lfence_dispatch)
+ __set_bit(X86_FEATURE_MFENCE_RDTSC, c->x86_capability);
+
++ /*
++ * On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakly ordered with
++ * everything, including reads and writes to address, and
++ * LFENCE/SFENCE instructions.
++ */
++ if (!cpu_has_clflushopt)
++ setup_force_cpu_cap(X86_BUG_CLFLUSH_MFENCE);
++
+ switch(c->x86)
+ {
+ case 0xf ... 0x11:
+diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
+index 0c912b8669f8..dcbb4064012e 100644
+--- xen/arch/x86/flushtlb.c.orig
++++ xen/arch/x86/flushtlb.c
+@@ -259,6 +259,13 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
+ return flags;
+ }
+
++/*
++ * On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakly ordered with everything,
++ * including reads and writes to address, and LFENCE/SFENCE instructions.
++ *
++ * This function only works safely after alternatives have run. Luckily, at
++ * the time of writing, we don't flush the caches that early.
++ */
+ void cache_flush(const void *addr, unsigned int size)
+ {
+ /*
+@@ -268,6 +275,8 @@ void cache_flush(const void *addr, unsigned int size)
+ unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16;
+ const void *end = addr + size;
+
++ alternative("", "mfence", X86_BUG_CLFLUSH_MFENCE);
++
+ addr -= (unsigned long)addr & (clflush_size - 1);
+ for ( ; addr < end; addr += clflush_size )
+ {
+@@ -283,7 +292,9 @@ void cache_flush(const void *addr, unsigned int size)
+ [p] "m" (*(const char *)(addr)));
+ }
+
+- alternative("", "sfence", X86_FEATURE_CLFLUSHOPT);
++ alternative_2("",
++ "sfence", X86_FEATURE_CLFLUSHOPT,
++ "mfence", X86_BUG_CLFLUSH_MFENCE);
+ }
+
+ void cache_writeback(const void *addr, unsigned int size)
+diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
+index fe2f97354fb6..09f619459bc7 100644
+--- xen/include/asm-x86/cpufeatures.h.orig
++++ xen/include/asm-x86/cpufeatures.h
+@@ -46,6 +46,7 @@ XEN_CPUFEATURE(XEN_IBT, X86_SYNTH(27)) /* Xen uses CET Indirect Branch
+ #define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x))
+
+ #define X86_BUG_FPU_PTRS X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */
++#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
+
+ /* Total number of capability words, inc synth and bug words. */
+ #define NCAPINTS (FSCAPINTS + X86_NR_SYNTH + X86_NR_BUG) /* N 32-bit words worth of info */
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/pv: Track and flush non-coherent mappings of RAM
+
+There are legitimate uses of WC mappings of RAM, e.g. for DMA buffers with
+devices that make non-coherent writes. The Linux sound subsystem makes
+extensive use of this technique.
+
+For such usecases, the guest's DMA buffer is mapped and consistently used as
+WC, and Xen doesn't interact with the buffer.
+
+However, a mischevious guest can use WC mappings to deliberately create
+non-coherency between the cache and RAM, and use this to trick Xen into
+validating a pagetable which isn't actually safe.
+
+Allocate a new PGT_non_coherent to track the non-coherency of mappings. Set
+it whenever a non-coherent writeable mapping is created. If the page is used
+as anything other than PGT_writable_page, force a cache flush before
+validation. Also force a cache flush before the page is returned to the heap.
+
+This is CVE-2022-26364, part of XSA-402.
+
+Reported-by: Jann Horn <jannh%google.com@localhost>
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: George Dunlap <george.dunlap%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+
+diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
+index 6ce8c19dcecc..1759b84ba97c 100644
+--- xen/arch/x86/mm.c.orig
++++ xen/arch/x86/mm.c
+@@ -997,6 +997,15 @@ get_page_from_l1e(
+ return -EACCES;
+ }
+
++ /*
++ * Track writeable non-coherent mappings to RAM pages, to trigger a cache
++ * flush later if the target is used as anything but a PGT_writeable page.
++ * We care about all writeable mappings, including foreign mappings.
++ */
++ if ( !boot_cpu_has(X86_FEATURE_XEN_SELFSNOOP) &&
++ (l1f & (PAGE_CACHE_ATTRS | _PAGE_RW)) == (_PAGE_WC | _PAGE_RW) )
++ set_bit(_PGT_non_coherent, &page->u.inuse.type_info);
++
+ return 0;
+
+ could_not_pin:
+@@ -2442,6 +2451,19 @@ static int cleanup_page_mappings(struct page_info *page)
+ }
+ }
+
++ /*
++ * Flush the cache if there were previously non-coherent writeable
++ * mappings of this page. This forces the page to be coherent before it
++ * is freed back to the heap.
++ */
++ if ( __test_and_clear_bit(_PGT_non_coherent, &page->u.inuse.type_info) )
++ {
++ void *addr = __map_domain_page(page);
++
++ cache_flush(addr, PAGE_SIZE);
++ unmap_domain_page(addr);
++ }
++
+ return rc;
+ }
+
+@@ -3016,6 +3038,22 @@ static int _get_page_type(struct page_info *page, unsigned long type,
+ if ( unlikely(!(nx & PGT_validated)) )
+ {
+ /*
++ * Flush the cache if there were previously non-coherent mappings of
++ * this page, and we're trying to use it as anything other than a
++ * writeable page. This forces the page to be coherent before we
++ * validate its contents for safety.
++ */
++ if ( (nx & PGT_non_coherent) && type != PGT_writable_page )
++ {
++ void *addr = __map_domain_page(page);
++
++ cache_flush(addr, PAGE_SIZE);
++ unmap_domain_page(addr);
++
++ page->u.inuse.type_info &= ~PGT_non_coherent;
++ }
++
++ /*
+ * No special validation needed for writable or shared pages. Page
+ * tables and GDT/LDT need to have their contents audited.
+ *
+diff --git a/xen/arch/x86/pv/grant_table.c b/xen/arch/x86/pv/grant_table.c
+index 0325618c9883..81c72e61ed55 100644
+--- xen/arch/x86/pv/grant_table.c.orig
++++ xen/arch/x86/pv/grant_table.c
+@@ -109,7 +109,17 @@ int create_grant_pv_mapping(uint64_t addr, mfn_t frame,
+
+ ol1e = *pl1e;
+ if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, curr, 0) )
++ {
++ /*
++ * We always create mappings in this path. However, our caller,
++ * map_grant_ref(), only passes potentially non-zero cache_flags for
++ * MMIO frames, so this path doesn't create non-coherent mappings of
++ * RAM frames and there's no need to calculate PGT_non_coherent.
++ */
++ ASSERT(!cache_flags || is_iomem_page(frame));
++
+ rc = GNTST_okay;
++ }
+
+ out_unlock:
+ page_unlock(page);
+@@ -294,7 +304,18 @@ int replace_grant_pv_mapping(uint64_t addr, mfn_t frame,
+ l1e_get_flags(ol1e), addr, grant_pte_flags);
+
+ if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, curr, 0) )
++ {
++ /*
++ * Generally, replace_grant_pv_mapping() is used to destroy mappings
++ * (n1le = l1e_empty()), but it can be a present mapping on the
++ * GNTABOP_unmap_and_replace path.
++ *
++ * In such cases, the PTE is fully transplanted from its old location
++ * via steal_linear_addr(), so we need not perform PGT_non_coherent
++ * checking here.
++ */
+ rc = GNTST_okay;
++ }
+
+ out_unlock:
+ page_unlock(page);
+diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
+index f5b8862b8374..5c19b71eca70 100644
+--- xen/include/asm-x86/mm.h.orig
++++ xen/include/asm-x86/mm.h
+@@ -53,8 +53,12 @@
+ #define _PGT_partial PG_shift(8)
+ #define PGT_partial PG_mask(1, 8)
+
++/* Has this page been mapped writeable with a non-coherent memory type? */
++#define _PGT_non_coherent PG_shift(9)
++#define PGT_non_coherent PG_mask(1, 9)
++
+ /* Count of uses of this frame as its current type. */
+-#define PGT_count_width PG_shift(8)
++#define PGT_count_width PG_shift(9)
+ #define PGT_count_mask ((1UL<<PGT_count_width)-1)
+
+ /* Are the 'type mask' bits identical? */
Index: pkgsrc/sysutils/xenkernel415/patches/patch-XSA404
diff -u /dev/null pkgsrc/sysutils/xenkernel415/patches/patch-XSA404:1.1
--- /dev/null Fri Jun 24 13:07:52 2022
+++ pkgsrc/sysutils/xenkernel415/patches/patch-XSA404 Fri Jun 24 13:07:52 2022
@@ -0,0 +1,499 @@
+$NetBSD: patch-XSA404,v 1.1 2022/06/24 13:07:52 bouyer Exp $
+
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/spec-ctrl: Make VERW flushing runtime conditional
+
+Currently, VERW flushing to mitigate MDS is boot time conditional per domain
+type. However, to provide mitigations for DRPW (CVE-2022-21166), we need to
+conditionally use VERW based on the trustworthiness of the guest, and the
+devices passed through.
+
+Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
+path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.
+
+Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
+disposition at domain creation time, and context switch the SCF_verw bit.
+
+For now, VERW flushing is used and controlled exactly as before, but later
+patches will add per-domain cases too.
+
+No change in behaviour.
+
+This is part of XSA-404.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Jan Beulich <jbeulich%suse.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
+index 1cab26fef61f..e4c820e17053 100644
+--- docs/misc/xen-command-line.pandoc.orig
++++ docs/misc/xen-command-line.pandoc
+@@ -2194,9 +2194,8 @@ in place for guests to use.
+ Use of a positive boolean value for either of these options is invalid.
+
+ The booleans `pv=`, `hvm=`, `msr-sc=`, `rsb=` and `md-clear=` offer fine
+-grained control over the alternative blocks used by Xen. These impact Xen's
+-ability to protect itself, and Xen's ability to virtualise support for guests
+-to use.
++grained control over the primitives by Xen. These impact Xen's ability to
++protect itself, and Xen's ability to virtualise support for guests to use.
+
+ * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
+ respectively.
+diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
+index b21272988006..4a61e951facf 100644
+--- xen/arch/x86/domain.c.orig
++++ xen/arch/x86/domain.c
+@@ -861,6 +861,8 @@ int arch_domain_create(struct domain *d,
+
+ d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
+
++ spec_ctrl_init_domain(d);
++
+ return 0;
+
+ fail:
+@@ -1994,14 +1996,15 @@ static void __context_switch(void)
+ void context_switch(struct vcpu *prev, struct vcpu *next)
+ {
+ unsigned int cpu = smp_processor_id();
++ struct cpu_info *info = get_cpu_info();
+ const struct domain *prevd = prev->domain, *nextd = next->domain;
+ unsigned int dirty_cpu = read_atomic(&next->dirty_cpu);
+
+ ASSERT(prev != next);
+ ASSERT(local_irq_is_enabled());
+
+- get_cpu_info()->use_pv_cr3 = false;
+- get_cpu_info()->xen_cr3 = 0;
++ info->use_pv_cr3 = false;
++ info->xen_cr3 = 0;
+
+ if ( unlikely(dirty_cpu != cpu) && dirty_cpu != VCPU_CPU_CLEAN )
+ {
+@@ -2065,6 +2068,11 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
+ *last_id = next_id;
+ }
+ }
++
++ /* Update the top-of-stack block with the VERW disposition. */
++ info->spec_ctrl_flags &= ~SCF_verw;
++ if ( nextd->arch.verw )
++ info->spec_ctrl_flags |= SCF_verw;
+ }
+
+ sched_context_switched(prev, next);
+diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
+index 49651f3c435a..5f5de45a1309 100644
+--- xen/arch/x86/hvm/vmx/entry.S.orig
++++ xen/arch/x86/hvm/vmx/entry.S
+@@ -87,7 +87,7 @@ UNLIKELY_END(realmode)
+
+ /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+ /* SPEC_CTRL_EXIT_TO_VMX Req: %rsp=regs/cpuinfo Clob: */
+- ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM
++ DO_SPEC_CTRL_COND_VERW
+
+ mov VCPU_hvm_guest_cr2(%rbx),%rax
+
+diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
+index 1e226102d399..b4efc940aa2b 100644
+--- xen/arch/x86/spec_ctrl.c.orig
++++ xen/arch/x86/spec_ctrl.c
+@@ -36,8 +36,8 @@ static bool __initdata opt_msr_sc_pv = true;
+ static bool __initdata opt_msr_sc_hvm = true;
+ static bool __initdata opt_rsb_pv = true;
+ static bool __initdata opt_rsb_hvm = true;
+-static int8_t __initdata opt_md_clear_pv = -1;
+-static int8_t __initdata opt_md_clear_hvm = -1;
++static int8_t __read_mostly opt_md_clear_pv = -1;
++static int8_t __read_mostly opt_md_clear_hvm = -1;
+
+ /* Cmdline controls for Xen's speculative settings. */
+ static enum ind_thunk {
+@@ -903,6 +903,13 @@ static __init void mds_calculations(uint64_t caps)
+ }
+ }
+
++void spec_ctrl_init_domain(struct domain *d)
++{
++ bool pv = is_pv_domain(d);
++
++ d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
++}
++
+ void __init init_speculation_mitigations(void)
+ {
+ enum ind_thunk thunk = THUNK_DEFAULT;
+@@ -1148,21 +1155,20 @@ void __init init_speculation_mitigations(void)
+ boot_cpu_has(X86_FEATURE_MD_CLEAR));
+
+ /*
+- * Enable MDS defences as applicable. The PV blocks need using all the
+- * time, and the Idle blocks need using if either PV or HVM defences are
+- * used.
++ * Enable MDS defences as applicable. The Idle blocks need using if
++ * either PV or HVM defences are used.
+ *
+ * HVM is more complicated. The MD_CLEAR microcode extends L1D_FLUSH with
+- * equivelent semantics to avoid needing to perform both flushes on the
+- * HVM path. The HVM blocks don't need activating if our hypervisor told
+- * us it was handling L1D_FLUSH, or we are using L1D_FLUSH ourselves.
++ * equivalent semantics to avoid needing to perform both flushes on the
++ * HVM path. Therefore, we don't need VERW in addition to L1D_FLUSH.
++ *
++ * After calculating the appropriate idle setting, simplify
++ * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
++ * guests", so spec_ctrl_init_domain() can calculate suitable settings.
+ */
+- if ( opt_md_clear_pv )
+- setup_force_cpu_cap(X86_FEATURE_SC_VERW_PV);
+ if ( opt_md_clear_pv || opt_md_clear_hvm )
+ setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
+- if ( opt_md_clear_hvm && !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush )
+- setup_force_cpu_cap(X86_FEATURE_SC_VERW_HVM);
++ opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
+
+ /*
+ * Warn the user if they are on MLPDS/MFBDS-vulnerable hardware with HT
+diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
+index 09f619459bc7..9eaab7a2a1fa 100644
+--- xen/include/asm-x86/cpufeatures.h.orig 2022-06-23 19:50:27.080499703 +0200
++++ xen/include/asm-x86/cpufeatures.h 2022-06-23 19:51:20.975755594 +0200
+@@ -35,8 +35,7 @@
+ XEN_CPUFEATURE(XEN_SELFSNOOP, X86_SYNTH(20)) /* SELFSNOOP gets used by Xen itself */
+ XEN_CPUFEATURE(SC_MSR_IDLE, X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */
+ XEN_CPUFEATURE(XEN_LBR, X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */
+-XEN_CPUFEATURE(SC_VERW_PV, X86_SYNTH(23)) /* VERW used by Xen for PV */
+-XEN_CPUFEATURE(SC_VERW_HVM, X86_SYNTH(24)) /* VERW used by Xen for HVM */
++/* Bits 23,24 unused. */
+ XEN_CPUFEATURE(SC_VERW_IDLE, X86_SYNTH(25)) /* VERW used by Xen for idle */
+ XEN_CPUFEATURE(XEN_SHSTK, X86_SYNTH(26)) /* Xen uses CET Shadow Stacks */
+
+diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
+index 7213d184b016..d0df7f83aa0c 100644
+--- xen/include/asm-x86/domain.h.orig
++++ xen/include/asm-x86/domain.h
+@@ -319,6 +319,9 @@ struct arch_domain
+ uint32_t pci_cf8;
+ uint8_t cmos_idx;
+
++ /* Use VERW on return-to-guest for its flushing side effect. */
++ bool verw;
++
+ union {
+ struct pv_domain pv;
+ struct hvm_domain hvm;
+diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
+index 9caecddfec96..68f6c46c470c 100644
+--- xen/include/asm-x86/spec_ctrl.h.orig
++++ xen/include/asm-x86/spec_ctrl.h
+@@ -24,6 +24,7 @@
+ #define SCF_use_shadow (1 << 0)
+ #define SCF_ist_wrmsr (1 << 1)
+ #define SCF_ist_rsb (1 << 2)
++#define SCF_verw (1 << 3)
+
+ #ifndef __ASSEMBLY__
+
+@@ -32,6 +33,7 @@
+ #include <asm/msr-index.h>
+
+ void init_speculation_mitigations(void);
++void spec_ctrl_init_domain(struct domain *d);
+
+ extern bool opt_ibpb;
+ extern bool opt_ssbd;
+diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
+index 02b3b18ce69f..5a590bac44aa 100644
+--- xen/include/asm-x86/spec_ctrl_asm.h.orig
++++ xen/include/asm-x86/spec_ctrl_asm.h
+@@ -136,6 +136,19 @@
+ #endif
+ .endm
+
++.macro DO_SPEC_CTRL_COND_VERW
++/*
++ * Requires %rsp=cpuinfo
++ *
++ * Issue a VERW for its flushing side effect, if indicated. This is a Spectre
++ * v1 gadget, but the IRET/VMEntry is serialising.
++ */
++ testb $SCF_verw, CPUINFO_spec_ctrl_flags(%rsp)
++ jz .L\@_verw_skip
++ verw CPUINFO_verw_sel(%rsp)
++.L\@_verw_skip:
++.endm
++
+ .macro DO_SPEC_CTRL_ENTRY maybexen:req
+ /*
+ * Requires %rsp=regs (also cpuinfo if !maybexen)
+@@ -231,8 +244,7 @@
+ #define SPEC_CTRL_EXIT_TO_PV \
+ ALTERNATIVE "", \
+ DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_PV; \
+- ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), \
+- X86_FEATURE_SC_VERW_PV
++ DO_SPEC_CTRL_COND_VERW
+
+ /*
+ * Use in IST interrupt/exception context. May interrupt Xen or PV context.
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/spec-ctrl: Enumeration for MMIO Stale Data controls
+
+The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
+data movement primitives.
+
+FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
+flushing side effect. This is only enumerated on parts where VERW had
+previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
+being fixed in hardware.
+
+FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
+clearing side effect of VERW can be turned off for performance reasons.
+
+This is part of XSA-404.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+
+diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
+index b4efc940aa2b..38e0cc2847e0 100644
+--- xen/arch/x86/spec_ctrl.c.orig
++++ xen/arch/x86/spec_ctrl.c
+@@ -323,7 +323,7 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
+ * Hardware read-only information, stating immunity to certain issues, or
+ * suggestions of which mitigation to use.
+ */
+- printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s\n",
++ printk(" Hardware hints:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ (caps & ARCH_CAPS_RDCL_NO) ? " RDCL_NO" : "",
+ (caps & ARCH_CAPS_IBRS_ALL) ? " IBRS_ALL" : "",
+ (caps & ARCH_CAPS_RSBA) ? " RSBA" : "",
+@@ -332,13 +332,16 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
+ (caps & ARCH_CAPS_SSB_NO) ? " SSB_NO" : "",
+ (caps & ARCH_CAPS_MDS_NO) ? " MDS_NO" : "",
+ (caps & ARCH_CAPS_TAA_NO) ? " TAA_NO" : "",
++ (caps & ARCH_CAPS_SBDR_SSDP_NO) ? " SBDR_SSDP_NO" : "",
++ (caps & ARCH_CAPS_FBSDP_NO) ? " FBSDP_NO" : "",
++ (caps & ARCH_CAPS_PSDP_NO) ? " PSDP_NO" : "",
+ (e8b & cpufeat_mask(X86_FEATURE_IBRS_ALWAYS)) ? " IBRS_ALWAYS" : "",
+ (e8b & cpufeat_mask(X86_FEATURE_STIBP_ALWAYS)) ? " STIBP_ALWAYS" : "",
+ (e8b & cpufeat_mask(X86_FEATURE_IBRS_FAST)) ? " IBRS_FAST" : "",
+ (e8b & cpufeat_mask(X86_FEATURE_IBRS_SAME_MODE)) ? " IBRS_SAME_MODE" : "");
+
+ /* Hardware features which need driving to mitigate issues. */
+- printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s\n",
++ printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ (e8b & cpufeat_mask(X86_FEATURE_IBPB)) ||
+ (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBPB" : "",
+ (e8b & cpufeat_mask(X86_FEATURE_IBRS)) ||
+@@ -353,7 +356,9 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
+ (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR)) ? " MD_CLEAR" : "",
+ (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL)) ? " SRBDS_CTRL" : "",
+ (e8b & cpufeat_mask(X86_FEATURE_VIRT_SSBD)) ? " VIRT_SSBD" : "",
+- (caps & ARCH_CAPS_TSX_CTRL) ? " TSX_CTRL" : "");
++ (caps & ARCH_CAPS_TSX_CTRL) ? " TSX_CTRL" : "",
++ (caps & ARCH_CAPS_FB_CLEAR) ? " FB_CLEAR" : "",
++ (caps & ARCH_CAPS_FB_CLEAR_CTRL) ? " FB_CLEAR_CTRL" : "");
+
+ /* Compiled-in support which pertains to mitigations. */
+ if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) || IS_ENABLED(CONFIG_SHADOW_PAGING) )
+diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
+index 947778105fb6..1e743461e91d 100644
+--- xen/include/asm-x86/msr-index.h.orig
++++ xen/include/asm-x86/msr-index.h
+@@ -59,6 +59,11 @@
+ #define ARCH_CAPS_IF_PSCHANGE_MC_NO (_AC(1, ULL) << 6)
+ #define ARCH_CAPS_TSX_CTRL (_AC(1, ULL) << 7)
+ #define ARCH_CAPS_TAA_NO (_AC(1, ULL) << 8)
++#define ARCH_CAPS_SBDR_SSDP_NO (_AC(1, ULL) << 13)
++#define ARCH_CAPS_FBSDP_NO (_AC(1, ULL) << 14)
++#define ARCH_CAPS_PSDP_NO (_AC(1, ULL) << 15)
++#define ARCH_CAPS_FB_CLEAR (_AC(1, ULL) << 17)
++#define ARCH_CAPS_FB_CLEAR_CTRL (_AC(1, ULL) << 18)
+
+ #define MSR_FLUSH_CMD 0x0000010b
+ #define FLUSH_CMD_L1D (_AC(1, ULL) << 0)
+@@ -76,4 +81,5 @@
+ #define MCU_OPT_CTRL_RNGDS_MITG_DIS (_AC(1, ULL) << 0)
++#define MCU_OPT_CTRL_FB_CLEAR_DIS (_AC(1, ULL) << 3)
+
+ #define MSR_RTIT_OUTPUT_BASE 0x00000560
+ #define MSR_RTIT_OUTPUT_MASK 0x00000561
+From: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Subject: x86/spec-ctrl: Add spec-ctrl=unpriv-mmio
+
+Per Xen's support statement, PCI passthrough should be to trusted domains
+because the overall system security depends on factors outside of Xen's
+control.
+
+As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.
+
+However, users who have risk assessed their configuration may be happy with
+the risk of DoS, but unhappy with the risk of cross-domain data leakage. Such
+users should enable this option.
+
+On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
+mitigate MMIO cross-domain data leakage.
+
+On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:
+
+ * On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
+ using FB_CLEAR.
+ * On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
+ srb-lock, previously used to mitigate SRBDS.
+
+Both mitigations require microcode from IPU 2022.1, May 2022.
+
+This is part of XSA-404.
+
+Signed-off-by: Andrew Cooper <andrew.cooper3%citrix.com@localhost>
+Reviewed-by: Roger Pau Monné <roger.pau%citrix.com@localhost>
+---
+Backporting note: For Xen 4.7 and earlier with bool_t not aliasing bool, the
+ARCH_CAPS_FB_CLEAR hunk needs !!
+
+diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
+index e4c820e17053..e17a835ed254 100644
+--- docs/misc/xen-command-line.pandoc.orig
++++ docs/misc/xen-command-line.pandoc
+@@ -2171,7 +2171,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
+ ### spec-ctrl (x86)
+ > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
+ > bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
+-> l1d-flush,branch-harden,srb-lock}=<bool> ]`
++> l1d-flush,branch-harden,srb-lock,unpriv-mmio}=<bool> ]`
+
+ Controls for speculative execution sidechannel mitigations. By default, Xen
+ will pick the most appropriate mitigations based on compiled in support,
+@@ -2250,8 +2250,16 @@ Xen will enable this mitigation.
+ On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force
+ or prevent Xen from protect the Special Register Buffer from leaking stale
+ data. By default, Xen will enable this mitigation, except on parts where MDS
+-is fixed and TAA is fixed/mitigated (in which case, there is believed to be no
+-way for an attacker to obtain the stale data).
++is fixed and TAA is fixed/mitigated and there are no unprivileged MMIO
++mappings (in which case, there is believed to be no way for an attacker to
++obtain stale data).
++
++The `unpriv-mmio=` boolean indicates whether the system has (or will have)
++less than fully privileged domains granted access to MMIO devices. By
++default, this option is disabled. If enabled, Xen will use the `FB_CLEAR`
++and/or `SRBDS_CTRL` functionality available in the Intel May 2022 microcode
++release to mitigate cross-domain leakage of data via the MMIO Stale Data
++vulnerabilities.
+
+ ### sync_console
+ > `= <boolean>`
+diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
+index 38e0cc2847e0..83b856fa9158 100644
+--- xen/arch/x86/spec_ctrl.c.orig
++++ xen/arch/x86/spec_ctrl.c
+@@ -67,7 +67,9 @@ static bool __initdata cpu_has_bug_msbds_only; /* => minimal HT impact. */
+ static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */
+
+ static int8_t __initdata opt_srb_lock = -1;
+ uint64_t __read_mostly default_xen_mcu_opt_ctrl;
++static bool __initdata opt_unpriv_mmio;
++static bool __read_mostly opt_fb_clear_mmio;
+
+ static int __init parse_spec_ctrl(const char *s)
+ {
+@@ -184,6 +186,8 @@ static int __init parse_spec_ctrl(const char *s)
+ opt_branch_harden = val;
+ else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 )
+ opt_srb_lock = val;
++ else if ( (val = parse_boolean("unpriv-mmio", s, ss)) >= 0 )
++ opt_unpriv_mmio = val;
+ else
+ rc = -EINVAL;
+
+@@ -392,7 +396,8 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps)
+ opt_srb_lock ? " SRB_LOCK+" : " SRB_LOCK-",
+ opt_ibpb ? " IBPB" : "",
+ opt_l1d_flush ? " L1D_FLUSH" : "",
+- opt_md_clear_pv || opt_md_clear_hvm ? " VERW" : "",
++ opt_md_clear_pv || opt_md_clear_hvm ||
++ opt_fb_clear_mmio ? " VERW" : "",
+ opt_branch_harden ? " BRANCH_HARDEN" : "");
+
+ /* L1TF diagnostics, printed if vulnerable or PV shadowing is in use. */
+@@ -912,7 +917,9 @@ void spec_ctrl_init_domain(struct domain *d)
+ {
+ bool pv = is_pv_domain(d);
+
+- d->arch.verw = pv ? opt_md_clear_pv : opt_md_clear_hvm;
++ d->arch.verw =
++ (pv ? opt_md_clear_pv : opt_md_clear_hvm) ||
++ (opt_fb_clear_mmio && is_iommu_enabled(d));
+ }
+
+ void __init init_speculation_mitigations(void)
+@@ -1148,6 +1155,18 @@ void __init init_speculation_mitigations(void)
+ mds_calculations(caps);
+
+ /*
++ * Parts which enumerate FB_CLEAR are those which are post-MDS_NO and have
++ * reintroduced the VERW fill buffer flushing side effect because of a
++ * susceptibility to FBSDP.
++ *
++ * If unprivileged guests have (or will have) MMIO mappings, we can
++ * mitigate cross-domain leakage of fill buffer data by issuing VERW on
++ * the return-to-guest path.
++ */
++ if ( opt_unpriv_mmio )
++ opt_fb_clear_mmio = caps & ARCH_CAPS_FB_CLEAR;
++
++ /*
+ * By default, enable PV and HVM mitigations on MDS-vulnerable hardware.
+ * This will only be a token effort for MLPDS/MFBDS when HT is enabled,
+ * but it is somewhat better than nothing.
+@@ -1160,18 +1179,20 @@ void __init init_speculation_mitigations(void)
+ boot_cpu_has(X86_FEATURE_MD_CLEAR));
+
+ /*
+- * Enable MDS defences as applicable. The Idle blocks need using if
+- * either PV or HVM defences are used.
++ * Enable MDS/MMIO defences as applicable. The Idle blocks need using if
++ * either the PV or HVM MDS defences are used, or if we may give MMIO
++ * access to untrusted guests.
+ *
+ * HVM is more complicated. The MD_CLEAR microcode extends L1D_FLUSH with
+ * equivalent semantics to avoid needing to perform both flushes on the
+- * HVM path. Therefore, we don't need VERW in addition to L1D_FLUSH.
++ * HVM path. Therefore, we don't need VERW in addition to L1D_FLUSH (for
++ * MDS mitigations. L1D_FLUSH is not safe for MMIO mitigations.)
+ *
+ * After calculating the appropriate idle setting, simplify
+ * opt_md_clear_hvm to mean just "should we VERW on the way into HVM
+ * guests", so spec_ctrl_init_domain() can calculate suitable settings.
+ */
+- if ( opt_md_clear_pv || opt_md_clear_hvm )
++ if ( opt_md_clear_pv || opt_md_clear_hvm || opt_fb_clear_mmio )
+ setup_force_cpu_cap(X86_FEATURE_SC_VERW_IDLE);
+ opt_md_clear_hvm &= !(caps & ARCH_CAPS_SKIP_L1DFL) && !opt_l1d_flush;
+
+@@ -1213,12 +1234,16 @@
+ * On some SRBDS-affected hardware, it may be safe to relax srb-lock
+ * by default.
+ *
+- * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only way
+- * to access the Fill Buffer. If TSX isn't available (inc. SKU
+- * reasons on some models), or TSX is explicitly disabled, then there
+- * is no need for the extra overhead to protect RDRAND/RDSEED.
++ * data becomes available to other contexts. To recover the data, an
++ * attacker needs to use:
++ * - SBDS (MDS or TAA to sample the cores fill buffer)
++ * - SBDR (Architecturally retrieve stale transaction buffer contents)
++ * - DRPW (Architecturally latch stale fill buffer data)
++ *
++ * On MDS_NO parts, and with TAA_NO or TSX unavailable/disabled, and there
++ * is no unprivileged MMIO access, the RNG data doesn't need protecting.
+ */
+- if ( opt_srb_lock == -1 &&
++ if ( opt_srb_lock == -1 && !opt_unpriv_mmio &&
+ (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO &&
+ (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && rtm_disabled)) )
+ opt_srb_lock = 0;
Home |
Main Index |
Thread Index |
Old Index