Age | Commit message (Collapse) | Author |
|
When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a
nested TDP VM is run, defer allocation of the array of hashed lists used
to track shadow MMU pages until the first shadow root is allocated.
Setting the list outside of mmu_lock is safe, as concurrent readers must
hold mmu_lock in some capacity, shadow pages can only be added (or removed)
from the list when mmu_lock is held for write, and tasks that are creating
a shadow root are serialized by slots_arch_lock. I.e. it's impossible for
the list to become non-empty until all readers go away, and so readers are
guaranteed to see an empty list even if they make multiple calls to
kvm_get_mmu_page_hash() in a single mmu_lock critical section.
Use smp_store_release() and smp_load_acquire() to access the hash table
pointer to ensure the stores to zero the lists are retired before readers
start to walk the list. E.g. if the compiler hoisted the store before the
zeroing of memory, for_each_gfn_valid_sp_with_gptes() could consume stale
kernel data.
Cc: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Dynamically allocate the (massive) array of hashed lists used to track
shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
all on its own, and is *exactly* an order-3 allocation. Dynamically
allocating the array will allow allocating "struct kvm" using kvmalloc(),
and will also allow deferring allocation of the array until it's actually
needed, i.e. until the first shadow root is allocated.
Opportunistically use kvmalloc() for the hashed lists, as an order-3
allocation is (stating the obvious) less likely to fail than an order-4
allocation, and the overhead of vmalloc() is undesirable given that the
size of the allocation is fixed.
Cc: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Only let userspace pass the same addresses that were used in KVM_SET_USER_MEMORY_REGION
(or KVM_SET_USER_MEMORY_REGION2); gpas in the the upper half of the address space
are an implementation detail of TDX and KVM.
Extracted from a patch by Sean Christopherson <seanjc@google.com>.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Bug[*] reported for TDX case when enabling KVM_PRE_FAULT_MEMORY in QEMU.
It turns out that @gpa passed to kvm_mmu_do_page_fault() doesn't have
shared bit set when the memory attribute of it is shared, and it leads
to wrong root in tdp_mmu_get_root_for_fault().
Fix it by embedding the direct bits in the gpa that is passed to
kvm_tdp_map_page(), when the memory of the gpa is not private.
[*] https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Closes: https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250611001018.2179964-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM x86 MMU changes for 6.16:
- Refine and harden handling of spurious faults.
- Use kvm_x86_call() instead of open coding static_call().
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.16
1. Don't flush tlb if HW PTW supported.
2. Add LoongArch KVM selftests support.
|
|
When changing memory attributes on a subset of a potential hugepage, add
the hugepage to the invalidation range tracking to prevent installing a
hugepage until the attributes are fully updated. Like the actual hugepage
tracking updates in kvm_arch_post_set_memory_attributes(), process only
the head and tail pages, as any potential hugepages that are entirely
covered by the range will already be tracked.
Note, only hugepage chunks whose current attributes are NOT mixed need to
be added to the invalidation set, as mixed attributes already prevent
installing a hugepage, and it's perfectly safe to install a smaller
mapping for a gfn whose attributes aren't changing.
Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
Cc: stable@vger.kernel.org
Reported-by: Michael Roth <michael.roth@amd.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Link: https://lore.kernel.org/r/20250430220954.522672-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Check request KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to free obsolete roots in
kvm_mmu_reload() to prevent kvm_mmu_reload() from seeing a stale obsolete
root.
Since kvm_mmu_reload() can be called outside the
vcpu_enter_guest() path (e.g., kvm_arch_vcpu_pre_fault_memory()), it may be
invoked after a root has been marked obsolete and before vcpu_enter_guest()
is invoked to process KVM_REQ_MMU_FREE_OBSOLETE_ROOTS and set root.hpa to
invalid. This causes kvm_mmu_reload() to fail to load a new root, which
can lead to kvm_arch_vcpu_pre_fault_memory() being stuck in the while
loop in kvm_tdp_map_page() since RET_PF_RETRY is always returned due to
is_page_fault_stale().
Keep the existing check of KVM_REQ_MMU_FREE_OBSOLETE_ROOTS in
vcpu_enter_guest() since the cost of kvm_check_request() is negligible,
especially a check that's guarded by kvm_request_pending().
Export symbol of kvm_mmu_free_obsolete_roots() as kvm_mmu_reload() is
inline and may be called outside of kvm.ko.
Fixes: 6e01b7601dfe ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013333.5817-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Warn if PFN changes on shadow-present SPTE in mmu_set_spte().
KVM should _never_ change the PFN of a shadow-present SPTE. In
mmu_set_spte(), there is a WARN_ON_ONCE() on pfn changes on shadow-present
SPTE in mmu_spte_update() to detect this condition. However, that
WARN_ON_ONCE() is not hittable since mmu_set_spte() invokes drop_spte()
earlier before mmu_spte_update(), which clears SPTE to a !shadow-present
state. So, before invoking drop_spte(), add a WARN_ON_ONCE() in
mmu_set_spte() to warn PFN change of a shadow-present SPTE.
For the spurious prefetch fault, only return RET_PF_SPURIOUS directly when
PFN is not changed. When PFN changes, fall through to follow the sequence
of drop_spte(), warn of PFN change, make_spte(), flush tlb, rmap_add().
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013310.5781-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Instead of simply treating a prefetch fault as spurious when there's a
shadow-present old SPTE, further check if the old SPTE is leaf to determine
if a prefetch fault is spurious.
It's not reasonable to treat a prefetch fault as spurious when there's a
shadow-present non-leaf SPTE without a corresponding shadow-present leaf
SPTE. e.g., in the following sequence, a prefetch fault should not be
considered spurious:
1. add a memslot with size 4K
2. prefault GPA A in the memslot
3. delete the memslot (zap all disabled)
4. re-add the memslot with size 2M
5. prefault GPA A again.
In step 5, the prefetch fault attempts to install a 2M huge entry.
Since step 3 zaps the leaf SPTE for GPA A while keeping the non-leaf SPTE,
the leaf entry will remain empty after step 5 if the fetch fault is
regarded as spurious due to a shadow-present non-leaf SPTE.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20250318013111.5648-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
This large commit contains the initial support for TDX in KVM. All x86
parts enable the host-side hypercalls that KVM uses to talk to the TDX
module, a software component that runs in a special CPU mode called SEAM
(Secure Arbitration Mode).
The series is in turn split into multiple sub-series, each with a separate
merge commit:
- Initialization: basic setup for using the TDX module from KVM, plus
ioctls to create TDX VMs and vCPUs.
- MMU: in TDX, private and shared halves of the address space are mapped by
different EPT roots, and the private half is managed by the TDX module.
Using the support that was added to the generic MMU code in 6.14,
add support for TDX's secure page tables to the Intel side of KVM.
Generic KVM code takes care of maintaining a mirror of the secure page
tables so that they can be queried efficiently, and ensuring that changes
are applied to both the mirror and the secure EPT.
- vCPU enter/exit: implement the callbacks that handle the entry of a TDX
vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore
of host state.
- Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to
userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits"
but are triggered through a different mechanism, similar to VMGEXIT for
SEV-ES and SEV-SNP.
- Interrupt handling: support for virtual interrupt injection as well as
handling VM-Exits that are caused by vectored events. Exclusive to
TDX are machine-check SMIs, which the kernel already knows how to
handle through the kernel machine check handler (commit 7911f145de5f,
"x86/mce: Implement recovery for errors in TDX/SEAM non-root mode")
- Loose ends: handling of the remaining exits from the TDX module, including
EPT violation/misconfig and several TDVMCALL leaves that are handled in
the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning
an error or ignoring operations that are not supported by TDX guests
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM x86/mmu changes for 6.15
Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where
"fast" means "without holding mmu_lock". Not taking mmu_lock allows multiple
aging actions to run in parallel, and more importantly avoids stalling vCPUs,
e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting
in memory.
For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and
KVM doesn't need to access any metadata to age SPTEs.
For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a
list of rmaps) to implement a per-rmap single-bit spinlock. When aging a gfn,
acquire the rmap's spinlock with read-only permissions, which allows hardening
and optimizing the locking and aging, e.g. locking an rmap for write requires
mmu_lock to also be held. The lock is NOT a true R/W spinlock, i.e. multiple
concurrent readers aren't supported.
To avoid forcing all SPTE updates to use atomic operations (clearing the
Accessed bit out of mmu_lock makes it inherently volatile), rework and rename
spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude
the Accessed bit. KVM (and mm/) already tolerates false positives/negatives
for Accessed information, and all testing has shown that reducing the latency
of aging is far more beneficial to overall system performance than providing
"perfect" young/old information.
|
|
The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled,
if shadow_memtype_mask is zero. As long as vmx_get_mt_mask is not
called for the shadow paging case, there is no need to consult
shadow_memtype_mask and it can be removed altogether.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have
KVM ignore guest PAT when this quirk is enabled.
On AMD platforms, KVM always honors guest PAT. On Intel however there are
two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop
is not supported. Second, UC access on certain Intel platforms can be very
slow[1] and honoring guest PAT on those platforms may break some old
guests that accidentally specify video RAM as UC. Those old guests may
never expect the slowness since KVM always forces WB previously. See [2].
So, introduce a quirk that KVM can enable by default on all Intel platforms
to avoid breaking old unmodifiable guests. Newer userspace can disable this
quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail
if self-snoop is not supported, i.e. if KVM cannot obey the wish.
The quirk is a no-op on AMD and also if any assigned devices have
non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is
another example of a quirk that is sometimes automatically disabled.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1]
Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2]
Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com>
[Use supported_quirks/inapplicable_quirks to support both AMD and
no-self-snoop cases, as well as to remove the shadow_memtype_mask check
from kvm_mmu_may_ignore_guest_pat(). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and
set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled.
Do not set it for TDs.
Until now, cpu_dirty_log_size was a system-wide value that is used for
all VMs and is set to the PML buffer size when PML was enabled in VMX.
However, PML is not currently supported for TDs, though PML remains
available for normal VMs as long as the feature is supported by hardware
and enabled in VMX.
Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer
size for normal VMs and 0 for TDs. This allows functions like
kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to
determine if PML is supported, in order to kick off vCPUs or request them
to update CPU dirty logging status (turn on/off PML in VMCS).
This fixes an issue first reported in [1], where QEMU attaches an
emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES
still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD.
KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there
vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx
struct for a TDX VM.
Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com>
Reported-by: Pedro Principeza <pedro.principeza@canonical.com>
Reported-by: Farrah Chen <farrah.chen@intel.com>
Closes: https://github.com/canonical/tdx/issues/202
Link: https://github.com/canonical/tdx/issues/202 [1]
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a parameter "kvm" to kvm_cpu_dirty_log_size() and down to its callers:
kvm_dirty_ring_get_rsvd_entries(), kvm_dirty_ring_alloc().
This is a preparation to make cpu_dirty_log_size a per-VM value rather than
a system-wide value.
No function changes expected.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
In future changes coco specific code will need to call kvm_tdp_map_page()
from within their respective gmem_post_populate() callbacks. Export it
so this can be done from vendor specific code. Since kvm_mmu_reload()
will be needed for this operation, export its callee kvm_mmu_load() as
well.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073827.22270-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise,
kvm_tdp_map_page() may get stuck in the kernel loop when there's only one
vCPU in the VM (or if the other vCPUs are not executing ioctls), even if
fatal errors have occurred.
kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX
ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY
is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY,
regardless of the specific error code from tdp_mmu_set_spte_atomic(),
tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable
in general cases where the only possible error code from these functions is
-EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL
errors.
Since this -EIO error is also a fatal error, check for VM dead in the
kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending.
The error -EIO is uncommon and has not been observed in real workloads.
Currently, it is only hypothetically triggered by bypassing the real
SEAMCALL and faking an error in the SEAMCALL wrapper.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250220102728.24546-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Disable TDX support when TDP MMU or mmio caching or EPT A/D bits aren't
supported.
As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
support for TDX isn't implemented.
TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO
emulation without installing SPTEs for MMIOs. However, TDX guest is
protected and KVM would meet errors when trying to emulate MMIOs for TDX
guest during instruction decoding. So, TDX guest relies on SPTEs being
installed for MMIOs, which are with no RWX bits and with VE suppress bit
unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL
in the VE handler to perform instruction decoding and have host do MMIO
emulation.
TDX also relies on EPT A/D bits as EPT A/D bits have been supported in all
CPUs since Haswell. Relying on it can avoid RWX bits being masked out in
the mirror page table for prefaulted entries.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
Requested by Sean at [1].
[1] https://lore.kernel.org/kvm/Zva4aORxE9ljlMNe@google.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Export a function to walk down the TDP without modifying it and simply
check if a GPA is mapped.
Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT. [1]
There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.
Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.
kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini.
Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241112073457.22011-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Update attr_filter field to zap both private and shared mappings for TDX
when memslot is deleted.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241112073426.21997-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
For TDX, the maxpa (CPUID.0x80000008.EAX[7:0]) is fixed as native and
the max_gpa (CPUID.0x80000008.EAX[23:16]) is configurable and used
to configure the EPT level and GPAW.
Use max_gpa to determine the TDP level.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
A VMM may send a non-fatal signal to its threads, including vCPU tasks,
at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU
task receives the signal while its trying to spawn the huge page recovery
vhost task, then KVM_RUN will fail due to copy_process() returning
-ERESTARTNOINTR.
Rework call_once() to mark the call complete if and only if the called
function succeeds, and plumb the function's true error code back to the
call_once() invoker. This provides userspace with the correct, non-fatal
error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows
subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge
page task.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
[implemented the kvm user side]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250227230631.303431-3-kbusch@meta.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Lets callers distinguish why the vhost task creation failed. No one
currently cares why it failed, so no real runtime change from this
patch, but that will not be the case for long.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250227230631.303431-2-kbusch@meta.com>
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Convert the shadow MMU to use per-rmap locking instead of the per-VM
mmu_lock to protect rmaps when aging SPTEs. When A/D bits are enabled, it
is safe to simply clear the Accessed bits, i.e. KVM just needs to ensure
the parent page table isn't freed.
The less obvious case is marking SPTEs for access tracking in the
non-A/D case (for EPT only). Because aging a gfn means making the SPTE
not-present, KVM needs to play nice with the case where the CPU has TLB
entries for a SPTE that is not-present in memory. For example, when
doing dirty tracking, if KVM encounters a non-present shadow accessed SPTE,
KVM must know to do a TLB invalidation.
Fortunately, KVM already provides (and relies upon) the necessary
functionality. E.g. KVM doesn't flush TLBs when aging pages (even in the
clear_flush_young() case), and when harvesting dirty bitmaps, KVM flushes
based on the dirty bitmaps, not on SPTEs.
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-12-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a lockless version of for_each_rmap_spte(), which is pretty much the
same as the normal version, except that it doesn't BUG() the host if a
non-present SPTE is encountered. When mmu_lock is held, it should be
impossible for a different task to zap a SPTE, _and_ zapped SPTEs must
be removed from their rmap chain prior to dropping mmu_lock. Thus, the
normal walker BUG()s if a non-present SPTE is encountered as something is
wildly broken.
When walking rmaps without holding mmu_lock, the SPTEs pointed at by the
rmap chain can be zapped/dropped, and so a lockless walk can observe a
non-present SPTE if it runs concurrently with a different operation that
is zapping SPTEs.
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-11-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Steal another bit from rmap entries (which are word aligned pointers, i.e.
have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
the bit to implement a *very* rudimentary per-rmap spinlock. The only
anticipated usage of the lock outside of mmu_lock is for aging gfns, and
collisions between aging and other MMU rmap operations are quite rare,
e.g. unless userspace is being silly and aging a tiny range over and over
in a tight loop, time between contention when aging an actively running VM
is O(seconds). In short, a more sophisticated locking scheme shouldn't be
necessary.
Note, the lock only protects the rmap structure itself, SPTEs that are
pointed at by a locked rmap can still be modified and zapped by another
task (KVM drops/zaps SPTEs before deleting the rmap entries)
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-10-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Refactor the pte_list and rmap code to always read and write rmap_head->val
exactly once, e.g. by collecting changes in a local variable and then
propagating those changes back to rmap_head->val as appropriate. This will
allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into
the rmap value alongside the MANY bit.
Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-9-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and
only if the VM has at least one shadow page, as opposed to checking if the
VM has rmaps. Checking for rmaps will effectively yield a false positive
if the VM ran nested TDP VMs in the past, but is not currently doing so.
Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-8-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reorder the processing of the TDP MMU versus the shadow MMU when aging
SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP
MMU reports that the page is young, i.e. completely avoid taking mmu_lock
if the TDP MMU SPTE is young. Swap the order for the test-and-age helper
as well for consistency.
Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-7-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Walk the TDP MMU in an RCU read-side critical section without holding
mmu_lock when harvesting and potentially updating age information on
TDP MMU SPTEs. Add a new macro to do RCU-safe walking of TDP MMU roots,
and do all SPTE aging with atomic updates; while clobbering Accessed
information is ok, KVM must not corrupt other bits, e.g. must not drop
a Dirty or Writable bit when making a SPTE young..
If updating a SPTE to mark it for access tracking fails, leave it as is
and treat it as if it were young. If the spte is being actively modified,
it is most likely young.
Acquire and release mmu_lock for write when harvesting age information
from the shadow MMU, as the shadow MMU doesn't yet support aging outside
of mmu_lock.
Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't force SPTE modifications to be done atomically if the only volatile
bit in the SPTE is the Accessed bit. KVM and the primary MMU tolerate
stale aging state, and the probability of an Accessed bit A/D assist being
clobbered *and* affecting again is likely far lower than the probability
of consuming stale information due to not flushing TLBs when aging.
Rename spte_has_volatile_bits() to spte_needs_atomic_update() to better
capture the nature of the helper.
Opportunstically do s/write/update on the TDP MMU wrapper, as it's not
simply the "write" that needs to be done atomically, it's the entire
update, i.e. the entire read-modify-write operation needs to be done
atomically so that KVM has an accurate view of the old SPTE.
Leave kvm_tdp_mmu_write_spte_atomic() as is. While the name is imperfect,
it pairs with kvm_tdp_mmu_write_spte(), which in turn pairs with
kvm_tdp_mmu_read_spte(). And renaming all of those isn't obviously a net
positive, and would require significant churn.
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-6-jthoughton@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest
mode prior to initializing the MMU for nested NPT so that guest_mode is
set in the MMU's role. KVM's model is that all L2 MMUs are tagged with
guest_mode, as the behavior of hypervisor MMUs tends to be significantly
different than kernel MMUs.
Practically speaking, the bug is relatively benign, as KVM only directly
queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and
kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths
that are optimizations (mmu_page_zap_pte() and
shadow_mmu_try_split_huge_pages()).
And while the role is incorprated into shadow page usage, because nested
NPT requires KVM to be using NPT for L1, reusing shadow pages across L1
and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs
will have direct=0.
Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning
of the flow instead of forcing guest_mode in the MMU, as nothing in
nested_vmcb02_prepare_control() between the old and new locations touches
TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present
inconsistent vCPU state to the MMU.
Fixes: 69cb877487de ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control")
Cc: stable@vger.kernel.org
Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When waking a VM's NX huge page recovery thread, ensure the thread is
actually alive before trying to wake it. Now that the thread is spawned
on-demand during KVM_RUN, a VM without a recovery thread is reachable via
the related module params.
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:vhost_task_wake+0x5/0x10
Call Trace:
<TASK>
set_nx_huge_pages+0xcc/0x1e0 [kvm]
param_attr_store+0x8a/0xd0
module_attr_store+0x1a/0x30
kernfs_fop_write_iter+0x12f/0x1e0
vfs_write+0x233/0x3e0
ksys_write+0x60/0xd0
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f3b52710104
</TASK>
Modules linked in: kvm_intel kvm
CR2: 0000000000000040
Fixes: 931656b9e2ff ("kvm: defer huge page recovery vhost task to later")
Cc: stable@vger.kernel.org
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250124234623.3609069-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Some libraries want to ensure they are single threaded before forking,
so making the kernel's kvm huge page recovery process a vhost task of
the user process breaks those. The minijail library used by crosvm is
one such affected application.
Defer the task to after the first VM_RUN call, which occurs after the
parent process has forked all its jailed processes. This needs to happen
only once for the kvm instance, so introduce some general-purpose
infrastructure for that, too. It's similar in concept to pthread_once;
except it is actually usable, because the callback takes a parameter.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250123153543.2769928-1-kbusch@meta.com>
[Move call_once API to include/linux. - Paolo]
Cc: stable@vger.kernel.org
Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.
Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.
For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another. Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.
As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory. This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
There are thus three sets of EPT page tables: external, mirror and
direct. In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:
external EPT - Hidden within the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
used by the processor.
direct EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
Modifying external EPT
----------------------
Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops. Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.
In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG. For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.
Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().
For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA. Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.
This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations. Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT. For example:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
|
|
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
- Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
|
|
Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID)
instead of 1 in kvm_mmu_page_fault().
The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e.,
npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(),
kvm_handle_page_fault()). They either check if the return value is > 0 (as
in npf_interception()) or pass it further to vcpu_run() to decide whether
to break out of the kernel loop and return to the user when r <= 0.
Therefore, returning any positive value is equivalent to returning 1.
Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure
a positive return value.
This is a preparation to allow TDX's EPT violation handler to check the
RET_PF* value and retry internally for RET_PF_RETRY.
No functional changes are intended.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a few sanity checks to prevent memslot GFNs from ever having alias bits
set.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.
If these memslot GFNs ever had the bit that selects between the two aliases
it could lead to unexpected behavior in the complicated code that directs
faulting or zapping operations between the roots that map the two aliases.
As a safety measure, prevent memslots from being set at a GFN range that
contains the alias bit.
Also, check in the kvm_faultin_pfn() for the fault path. This later check
does less today, as the alias bits are specifically stripped from the GFN
being checked, however future code could possibly call in to the fault
handler in a way that skips this stripping. Since kvm_faultin_pfn() now
has many references to vcpu->kvm, extract it to local variable.
Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rename kvm_tdp_mmu_invalidate_all_roots() to
kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types
as an argument.
kvm_tdp_mmu_invalidate_roots() is called with different root types. For
kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
down a VM it needs to invalidate all roots. Have the callers only
invalidate the required roots instead of all roots.
Within kvm_tdp_mmu_invalidate_roots(), respect the root type
passed by checking the root type in root iterator.
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-17-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add the ability for the TDP MMU to maintain a mirror of a separate
mapping.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
and it will make the code confusing when being read from their
perspective. Special operations added for TDX that set private or zap
private memory will have nothing to do with these other private memory
technologies. (SEV, etc).
To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.
private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
mirror PT root | external PT root
| | |
V | V
mirror PT --hook to propagate-->external PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|
Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.
In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct
roots to private/shared with conditionals. It could also be implemented by
making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line
up such that conversion could be a direct assignment with a case. Don't do
this because the mapping of private to mirrored is confusing enough. So it
is worth not hiding the logic in type casting.
Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
rebuilt like a normal root. It needs to persist for the lifetime of the VM.
The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_external_spt().
Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The kvm_tdp_mmu_alloc_root() function currently always returns 0. This
allows for the caller, mmu_alloc_direct_roots(), to call
kvm_tdp_mmu_alloc_root() and also return 0 in one line:
return kvm_tdp_mmu_alloc_root(vcpu);
So it is useful even though the return value of kvm_tdp_mmu_alloc_root()
is always the same. However, in future changes, kvm_tdp_mmu_alloc_root()
will be called twice in mmu_alloc_direct_roots(). This will force the
first call to either awkwardly handle the return value that will always
be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void.
Do it in a separate change so the future change will be cleaner.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240718211230.1492011-7-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add an external pointer to struct kvm_mmu_page for TDX's private page table
and add helper functions to allocate/initialize/free a private page table
page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU
doesn't use unsync_children and write_flooding_count, pack them to have
room for a pointer and use a union to avoid memory overhead.
For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.
When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table allocate two sets of page tables for the private
half of the GPA space.
For the page tables that KVM will walk, allocate them like normal and refer
to them as mirror page tables. Additionally allocate one more page for the
page tables the CPU will walk, and call them external page tables. Resolve
the KVM page fault with the existing code, and do additional operations
necessary for modifying the external page table in future patches.
The relationship of the types of page tables in this scheme is depicted
below:
KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root mirror PT root | private PT root
| | | |
V V | V
shared PT mirror PT --propagate--> external PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT - Page table
Shared PT - Visible to KVM, and the CPU uses it for shared mappings.
External PT - The CPU uses it, but it is invisible to KVM. TDX module
updates this table to map private guest pages.
Mirror PT - It is visible to KVM, but the CPU doesn't use it. KVM uses
it to propagate PT change to the actual private PT.
Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
to the TDX vm type.
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_gfn_range_filter
attr_filter. Update the core zapping operations to set them appropriately.
TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.
Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.
There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.
Set target alias with enum appropriately for these MMU operations:
- For KVM's mmu notifier callbacks, zap shared pages only because private
pages won't have a userspace mapping
- For setting memory attributes, kvm_arch_pre_set_memory_attributes()
chooses the aliases based on the attribute.
- For guest_memfd invalidations, zap private only.
Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Prepare for a future TDX patch which asserts that atomic zapping
(i.e. zapping with mmu_lock taken for read) don't operate on mirror roots.
When tearing down a VM, all roots have to be zapped (including mirror
roots once they're in place) so do that with the mmu_lock taken for write.
kvm_mmu_uninit_tdp_mmu() is invoked either before or after executing any
atomic operations on SPTEs by vCPU threads. Therefore, it will not impact
vCPU threads performance if kvm_tdp_mmu_zap_invalidated_roots() acquires
mmu_lock for write to zap invalid roots.
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-2-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Treat slow-path TDP MMU faults as spurious if the access is allowed given
the existing SPTE to fix a benign warning (other than the WARN itself)
due to replacing a writable SPTE with a read-only SPTE, and to avoid the
unnecessary LOCK CMPXCHG and subsequent TLB flush.
If a read fault races with a write fault, fast GUP fails for any reason
when trying to "promote" the read fault to a writable mapping, and KVM
resolves the write fault first, then KVM will end up trying to install a
read-only SPTE (for a !map_writable fault) overtop a writable SPTE.
Note, it's not entirely clear why fast GUP fails, or if that's even how
KVM ends up with a !map_writable fault with a writable SPTE. If something
else is going awry, e.g. due to a bug in mmu_notifiers, then treating read
faults as spurious in this scenario could effectively mask the underlying
problem.
However, retrying the faulting access instead of overwriting an existing
SPTE is functionally correct and desirable irrespective of the WARN, and
fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
bit in primary MMU's PTE is toggled and causes a PTE value mismatch. The
WARN was also recently added, specifically to track down scenarios where
KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
doesn't regress KVM's bug-finding capabilities in any way. In short,
letting the WARN linger because there's a tiny chance it's due to a bug
elsewhere would be excessively paranoid.
Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable")
Reported-by: Lei Yang <leiyang@redhat.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588
Tested-by: Lei Yang <leiyang@redhat.com>
Link: https://lore.kernel.org/r/20241218213611.3181643-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().
The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.
Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.
However, because of these threads, cgroups which have kvm instances
inside never complete freezing. This can be trivially reproduced:
root@test ~# mkdir /sys/fs/cgroup/test
root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs
root@test ~# qemu-system-x86_64 -nographic -enable-kvm
and in another terminal:
root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
root@test ~# cat /sys/fs/cgroup/test/cgroup.events
populated 1
frozen 0
The cgroup freezing happens in the signal delivery path but
kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never
calls into the signal delivery path and thus never gets frozen. Because
the cgroup freezer determines whether a given cgroup is frozen by
comparing the number of frozen threads to the total number of threads
in the cgroup, the cgroup never becomes frozen and users waiting for
the state transition may hang indefinitely.
Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree. Use it instead of the
homegrown alternative.
Incidentally, the new code is also better behaved when you flip recovery
back and forth to disabled and back to enabled. If your recovery period
is 1 minute, it will run the next recovery after 1 minute independent
of how many times you flipped the parameter.
(Commit message based on emails from Tejun).
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Luca Boccassi <bluca@debian.org>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM x86 misc changes for 6.13
- Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
- Quirk KVM's misguided behavior of initialized certain feature MSRs to
their maximum supported feature set, which can result in KVM creating
invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero
value results in the vCPU having invalid state if userspace hides PDCM
from the guest, which can lead to save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support LA57
to better follow the "architecture", in quotes because the actual
behavior is poorly documented. E.g. most MSR writes and descriptor
table loads ignore CR4.LA57 and operate purely on whether the CPU
supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(), as
filling the cache from IRQ context is generally unsafe, and harden the
cache accessors to try to prevent similar issues from occuring in the
future.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
- Minor cleanups
|
|
Drop the per-VM zapped_obsolete_pages list now that the usage from the
defunct mmu_shrinker is gone, and instead use a local list to track pages
in kvm_zap_obsolete_pages(), the sole remaining user of
zapped_obsolete_pages.
Opportunistically add an assertion to verify and document that slots_lock
must be held, i.e. that there can only be one active instance of
kvm_zap_obsolete_pages() at any given time, and by doing so also prove
that using a local list instead of a per-VM list doesn't change any
functionality (beyond trivialities like list initialization).
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|