Source-Changes-HG archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[src/trunk]: src/sys/arch/x86 Add the EPT pmap code, used by Intel-VMX.



details:   https://anonhg.NetBSD.org/src/rev/d274a003b048
branches:  trunk
changeset: 996919:d274a003b048
user:      maxv <maxv%NetBSD.org@localhost>
date:      Wed Feb 13 08:38:25 2019 +0000

description:
Add the EPT pmap code, used by Intel-VMX.

The idea is that under NVMM, we don't want to implement the hypervisor page
tables manually in NVMM directly, because we want pageable guests; that is,
we want to allow UVM to unmap guest pages when the host comes under
pressure.

Contrary to AMD-SVM, Intel-VMX uses a different set of PTE bits from
native, and this has three important consequences:

 - We can't use the native PTE bits, so each time we want to modify the
   page tables, we need to know whether we're dealing with a native pmap
   or an EPT pmap. This is accomplished with callbacks, that handle
   everything PTE-related.

 - There is no recursive slot possible, so we can't use pmap_map_ptes().
   Rather, we walk down the EPT trees via the direct map, and that's
   actually a lot simpler (and probably faster too...).

 - The kernel is never mapped in an EPT pmap. An EPT pmap cannot be loaded
   on the host. This has two sub-consequences: at creation time we must
   zero out all of the top-level PTEs, and at destruction time we force
   the page out of the pool cache and into the pool, to ensure that a next
   allocation will invoke pmap_pdp_ctor() to create a native pmap and not
   recycle some stale EPT entries.

To create an EPT pmap, the caller must invoke pmap_ept_transform() on a
newly-allocated native pmap. And that's about it, from then on the EPT
callbacks will be invoked, and the pmap can be destroyed via the usual
pmap_destroy(). The TLB shootdown callback is not initialized however,
it is the responsibility of the hypervisor (NVMM) to set it.

There are some twisted cases that we need to handle. For example if
pmap_is_referenced() is called on a physical page that is entered both by
a native pmap and by an EPT pmap, we take the Accessed bits from the
two pmaps using different PTE sets in each case, and combine them into a
generic PP_ATTRS_U flag (that does not depend on the pmap type).

Given that the EPT layout is a 4-Level tree with the same address space as
native x86_64, we allow ourselves to use a few native macros in EPT, such
as pmap_pa2pte(), rather than re-defining them with "ept" in the name.

Even though this EPT code is rather complex, it is not too intrusive: just
a few callbacks in a few pmap functions, predicted-false to give priority
to native. So this comes with no messy #ifdef or performance cost.

diffstat:

 sys/arch/x86/include/pmap.h |    4 +-
 sys/arch/x86/x86/pmap.c     |  874 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 875 insertions(+), 3 deletions(-)

diffs (truncated from 910 to 300 lines):

diff -r 72656ba92610 -r d274a003b048 sys/arch/x86/include/pmap.h
--- a/sys/arch/x86/include/pmap.h       Wed Feb 13 07:55:33 2019 +0000
+++ b/sys/arch/x86/include/pmap.h       Wed Feb 13 08:38:25 2019 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: pmap.h,v 1.96 2019/02/11 14:59:32 cherry Exp $ */
+/*     $NetBSD: pmap.h,v 1.97 2019/02/13 08:38:25 maxv Exp $   */
 
 /*
  * Copyright (c) 1997 Charles D. Cranor and Washington University.
@@ -370,6 +370,8 @@
 
 bool           pmap_is_curpmap(struct pmap *);
 
+void           pmap_ept_transform(struct pmap *);
+
 #ifndef __HAVE_DIRECT_MAP
 void           pmap_vpage_cpu_init(struct cpu_info *);
 #endif
diff -r 72656ba92610 -r d274a003b048 sys/arch/x86/x86/pmap.c
--- a/sys/arch/x86/x86/pmap.c   Wed Feb 13 07:55:33 2019 +0000
+++ b/sys/arch/x86/x86/pmap.c   Wed Feb 13 08:38:25 2019 +0000
@@ -1,4 +1,4 @@
-/*     $NetBSD: pmap.c,v 1.321 2019/02/11 14:59:33 cherry Exp $        */
+/*     $NetBSD: pmap.c,v 1.322 2019/02/13 08:38:25 maxv Exp $  */
 
 /*
  * Copyright (c) 2008, 2010, 2016, 2017 The NetBSD Foundation, Inc.
@@ -130,7 +130,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: pmap.c,v 1.321 2019/02/11 14:59:33 cherry Exp $");
+__KERNEL_RCSID(0, "$NetBSD: pmap.c,v 1.322 2019/02/13 08:38:25 maxv Exp $");
 
 #include "opt_user_ldt.h"
 #include "opt_lockdebug.h"
@@ -4797,3 +4797,873 @@
 
        return pflag;
 }
+
+#if defined(__HAVE_DIRECT_MAP) && defined(__x86_64__) && !defined(XEN)
+
+/*
+ * -----------------------------------------------------------------------------
+ * *****************************************************************************
+ * *****************************************************************************
+ * *****************************************************************************
+ * *****************************************************************************
+ * **************** HERE BEGINS THE EPT CODE, USED BY INTEL-VMX ****************
+ * *****************************************************************************
+ * *****************************************************************************
+ * *****************************************************************************
+ * *****************************************************************************
+ * -----------------------------------------------------------------------------
+ *
+ * These functions are invoked as callbacks from the code above. Contrary to
+ * native, EPT does not have a recursive slot; therefore, it is not possible
+ * to call pmap_map_ptes(). Instead, we use the direct map and walk down the
+ * tree manually.
+ *
+ * Apart from that, the logic is mostly the same as native. Once a pmap has
+ * been created, NVMM calls pmap_ept_transform() to make it an EPT pmap.
+ * After that we're good, and the callbacks will handle the translations
+ * for us.
+ *
+ * -----------------------------------------------------------------------------
+ */
+
+/* Hardware bits. */
+#define EPT_R          __BIT(0)        /* read */
+#define EPT_W          __BIT(1)        /* write */
+#define EPT_X          __BIT(2)        /* execute */
+#define EPT_T          __BITS(5,3)     /* type */
+#define                TYPE_UC 0
+#define                TYPE_WC 1
+#define                TYPE_WT 4
+#define                TYPE_WP 5
+#define                TYPE_WB 6
+#define EPT_NOPAT      __BIT(6)
+#define EPT_L          __BIT(7)        /* large */
+#define EPT_A          __BIT(8)        /* accessed */
+#define EPT_D          __BIT(9)        /* dirty */
+/* Software bits. */
+#define EPT_PVLIST     __BIT(60)
+#define EPT_WIRED      __BIT(61)
+
+#define pmap_ept_valid_entry(pte)      (pte & EPT_R)
+
+static inline void
+pmap_ept_stats_update_bypte(struct pmap *pmap, pt_entry_t npte, pt_entry_t opte)
+{
+       int resid_diff = ((npte & EPT_R) ? 1 : 0) - ((opte & EPT_R) ? 1 : 0);
+       int wired_diff = ((npte & EPT_WIRED) ? 1 : 0) - ((opte & EPT_WIRED) ? 1 : 0);
+
+       KASSERT((npte & (EPT_R | EPT_WIRED)) != EPT_WIRED);
+       KASSERT((opte & (EPT_R | EPT_WIRED)) != EPT_WIRED);
+
+       pmap_stats_update(pmap, resid_diff, wired_diff);
+}
+
+static pt_entry_t
+pmap_ept_type(u_int flags)
+{
+       u_int cacheflags = (flags & PMAP_CACHE_MASK);
+       pt_entry_t ret;
+
+       switch (cacheflags) {
+       case PMAP_NOCACHE:
+       case PMAP_NOCACHE_OVR:
+               ret = __SHIFTIN(TYPE_UC, EPT_T);
+               break;
+       case PMAP_WRITE_COMBINE:
+               ret = __SHIFTIN(TYPE_WC, EPT_T);
+               break;
+       case PMAP_WRITE_BACK:
+       default:
+               ret = __SHIFTIN(TYPE_WB, EPT_T);
+               break;
+       }
+
+       ret |= EPT_NOPAT;
+       return ret;
+}
+
+static inline pt_entry_t
+pmap_ept_prot(vm_prot_t prot)
+{
+       pt_entry_t res = 0;
+
+       if (prot & VM_PROT_READ)
+               res |= EPT_R;
+       if (prot & VM_PROT_WRITE)
+               res |= EPT_W;
+       if (prot & VM_PROT_EXECUTE)
+               res |= EPT_X;
+
+       return res;
+}
+
+static inline uint8_t
+pmap_ept_to_pp_attrs(pt_entry_t ept)
+{
+       uint8_t ret = 0;
+       if (ept & EPT_D)
+               ret |= PP_ATTRS_M;
+       if (ept & EPT_A)
+               ret |= PP_ATTRS_U;
+       if (ept & EPT_W)
+               ret |= PP_ATTRS_W;
+       return ret;
+}
+
+static inline pt_entry_t
+pmap_pp_attrs_to_ept(uint8_t attrs)
+{
+       pt_entry_t ept = 0;
+       if (attrs & PP_ATTRS_M)
+               ept |= EPT_D;
+       if (attrs & PP_ATTRS_U)
+               ept |= EPT_A;
+       if (attrs & PP_ATTRS_W)
+               ept |= EPT_W;
+       return ept;
+}
+
+/*
+ * Helper for pmap_ept_free_ptp.
+ * tree[0] = &L2[L2idx]
+ * tree[1] = &L3[L3idx]
+ * tree[2] = &L4[L4idx]
+ */
+static void
+pmap_ept_get_tree(struct pmap *pmap, vaddr_t va, pd_entry_t **tree)
+{
+       pt_entry_t *pteva;
+       paddr_t ptepa;
+       int i, index;
+
+       ptepa = pmap->pm_pdirpa[0];
+       for (i = PTP_LEVELS; i > 1; i--) {
+               index = pl_pi(va, i);
+               pteva = (pt_entry_t *)PMAP_DIRECT_MAP(ptepa);
+               KASSERT(pmap_ept_valid_entry(pteva[index]));
+               tree[i - 2] = &pteva[index];
+               ptepa = pmap_pte2pa(pteva[index]);
+       }
+}
+
+static void
+pmap_ept_free_ptp(struct pmap *pmap, struct vm_page *ptp, vaddr_t va)
+{
+       pd_entry_t *tree[3];
+       int level;
+
+       KASSERT(pmap != pmap_kernel());
+       KASSERT(mutex_owned(pmap->pm_lock));
+       KASSERT(kpreempt_disabled());
+
+       pmap_ept_get_tree(pmap, va, tree);
+
+       level = 1;
+       do {
+               (void)pmap_pte_testset(tree[level - 1], 0);
+
+               pmap_freepage(pmap, ptp, level);
+               if (level < PTP_LEVELS - 1) {
+                       ptp = pmap_find_ptp(pmap, va, (paddr_t)-1, level + 1);
+                       ptp->wire_count--;
+                       if (ptp->wire_count > 1)
+                               break;
+               }
+       } while (++level < PTP_LEVELS);
+       pmap_pte_flush();
+}
+
+/* Allocate L4->L3->L2. Return L2. */
+static struct vm_page *
+pmap_ept_get_ptp(struct pmap *pmap, vaddr_t va, int flags)
+{
+       struct vm_page *ptp;
+       struct {
+               struct vm_page *pg;
+               bool new;
+       } pt[PTP_LEVELS + 1];
+       int i, aflags;
+       unsigned long index;
+       pd_entry_t *pteva;
+       paddr_t ptepa;
+       struct uvm_object *obj;
+       voff_t off;
+
+       KASSERT(pmap != pmap_kernel());
+       KASSERT(mutex_owned(pmap->pm_lock));
+       KASSERT(kpreempt_disabled());
+
+       memset(pt, 0, sizeof(pt));
+       aflags = ((flags & PMAP_CANFAIL) ? 0 : UVM_PGA_USERESERVE) |
+           UVM_PGA_ZERO;
+
+       /*
+        * Loop through all page table levels allocating a page
+        * for any level where we don't already have one.
+        */
+       for (i = PTP_LEVELS; i > 1; i--) {
+               obj = &pmap->pm_obj[i - 2];
+               off = ptp_va2o(va, i - 1);
+
+               PMAP_SUBOBJ_LOCK(pmap, i - 2);
+               pt[i].pg = uvm_pagelookup(obj, off);
+               if (pt[i].pg == NULL) {
+                       pt[i].pg = uvm_pagealloc(obj, off, NULL, aflags);
+                       pt[i].new = true;
+               }
+               PMAP_SUBOBJ_UNLOCK(pmap, i - 2);
+
+               if (pt[i].pg == NULL)
+                       goto fail;
+       }
+
+       /*
+        * Now that we have all the pages looked up or allocated,
+        * loop through again installing any new ones into the tree.
+        */
+       ptepa = pmap->pm_pdirpa[0];
+       for (i = PTP_LEVELS; i > 1; i--) {
+               index = pl_pi(va, i);
+               pteva = (pt_entry_t *)PMAP_DIRECT_MAP(ptepa);
+
+               if (pmap_ept_valid_entry(pteva[index])) {
+                       KASSERT(!pt[i].new);
+                       ptepa = pmap_pte2pa(pteva[index]);
+                       continue;
+               }
+
+               ptp = pt[i].pg;
+               ptp->flags &= ~PG_BUSY; /* never busy */
+               ptp->wire_count = 1;
+               pmap->pm_ptphint[i - 2] = ptp;
+               ptepa = VM_PAGE_TO_PHYS(ptp);
+               pmap_pte_set(&pteva[index], ptepa | EPT_R | EPT_W | EPT_X);
+
+               pmap_pte_flush();
+               pmap_stats_update(pmap, 1, 0);
+
+               /*
+                * If we're not in the top level, increase the
+                * wire count of the parent page.
+                */
+               if (i < PTP_LEVELS) {
+                       pt[i + 1].pg->wire_count++;
+               }
+       }
+       ptp = pt[2].pg;
+       KASSERT(ptp != NULL);
+       pmap->pm_ptphint[0] = ptp;
+       return ptp;
+
+       /*
+        * Allocation of a PTP failed, free any others that we just allocated.



Home | Main Index | Thread Index | Old Index