Age | Commit message (Collapse) | Author |
|
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have
KVM ignore guest PAT when this quirk is enabled.
On AMD platforms, KVM always honors guest PAT. On Intel however there are
two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop
is not supported. Second, UC access on certain Intel platforms can be very
slow[1] and honoring guest PAT on those platforms may break some old
guests that accidentally specify video RAM as UC. Those old guests may
never expect the slowness since KVM always forces WB previously. See [2].
So, introduce a quirk that KVM can enable by default on all Intel platforms
to avoid breaking old unmodifiable guests. Newer userspace can disable this
quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail
if self-snoop is not supported, i.e. if KVM cannot obey the wish.
The quirk is a no-op on AMD and also if any assigned devices have
non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is
another example of a quirk that is sometimes automatically disabled.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1]
Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2]
Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com>
[Use supported_quirks/inapplicable_quirks to support both AMD and
no-self-snoop cases, as well as to remove the shadow_memtype_mask check
from kvm_mmu_may_ignore_guest_pat(). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
In some cases, the handling of quirks is split between platform-specific
code and generic code, or it is done entirely in generic code, but the
relevant bug does not trigger on some platforms; for example,
this will be the case for "ignore guest PAT". Allow unaffected vendor
modules to disable handling of a quirk for all VMs via a new entry in
kvm_caps.
Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API
tells userspace that KVM *knows* that some of its past behavior was bogus
or just undesirable. In other words, it's plausible for userspace to
refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so
preserve that and make it part of the API.
As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on
Intel systems.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM. As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM. To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector. And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.
Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.
Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.
For the TD exits not due to HLT TDCALL, skip checking RVI pending in
tdx_protected_apic_has_interrupt(). Except for the guest being stupid
(e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to
have an interrupt in RVI that is fully unmasked. There is no any CPU flows
that modify RVI in the middle of instruction execution. I.e. if RVI is
non-zero, then either the interrupt has been pending since before the TD
exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM
doesn't care about STI/SS shadows outside of the HALTED case. And if the
interrupt was pending before TD exit, then it _must_ be blocked, otherwise
the interrupt would have been serviced at the instruction boundary.
For the HLT TDCALL case, it will be handled in a future patch when HLT
TDCALL is supported.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest
DRs.
TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter,
and resets DRs to architectural INIT state on TD exit. Use the new
flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to
save/restore guest DRs. KVM still needs to restore host DRs after TD
exit if there are active breakpoints in the host, which is covered by
the existing code.
MOV-DR exiting is always cleared for TDX guests, so the handler for DR
access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add
a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH
are set.
Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Several MSRs are constant and only used in userspace(ring 3). But VMs may
have different values. KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.
TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit. It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware. This inconsistency needs to be
fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values. kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr. So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20250129095902.16391-9-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and
set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled.
Do not set it for TDs.
Until now, cpu_dirty_log_size was a system-wide value that is used for
all VMs and is set to the PML buffer size when PML was enabled in VMX.
However, PML is not currently supported for TDs, though PML remains
available for normal VMs as long as the feature is supported by hardware
and enabled in VMX.
Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer
size for normal VMs and 0 for TDs. This allows functions like
kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to
determine if PML is supported, in order to kick off vCPUs or request them
to update CPU dirty logging status (turn on/off PML in VMCS).
This fixes an issue first reported in [1], where QEMU attaches an
emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES
still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD.
KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there
vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx
struct for a TDX VM.
Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com>
Reported-by: Pedro Principeza <pedro.principeza@canonical.com>
Reported-by: Farrah Chen <farrah.chen@intel.com>
Closes: https://github.com/canonical/tdx/issues/202
Link: https://github.com/canonical/tdx/issues/202 [1]
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
TD guest vcpu needs TDX specific initialization before running. Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/
(Sean)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Implement managing the TDX private KeyID to implement, create, destroy
and free for a TDX guest.
When creating at TDX guest, assign a TDX private KeyID for the TDX guest
for memory encryption, and allocate pages for the guest. These are used
for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS).
On destruction, free the allocated pages, and the KeyID.
Before tearing down the private page tables, TDX requires the guest TD to
be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops
hook. The TDR control structures can be freed in the vm_destroy() hook,
which runs last.
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Fix build issue in kvm-coco-queue
- Init ret earlier to fix __tdx_td_init() error handling. (Chao)
- Standardize -EAGAIN for __tdx_td_init() retry errors (Rick)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit
changing TSC offset/multiplier when guest_tsc_protected is true.
X86 confidential computing technology defines protected guest TSC so that
the VMM can't change the TSC offset/multiplier once vCPU is initialized.
SEV-SNP defines Secure TSC as optional, whereas TDX mandates it.
KVM has common logic on x86 that tries to guess or adjust TSC
offset/multiplier for better guest TSC and TSC interrupt latency
at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration
over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes
(kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR
(kvm_set_msr_common()).
The current x86 KVM implementation conflicts with protected TSC because the
VMM can't change the TSC offset/multiplier.
Because KVM emulates the TSC timer or the TSC deadline timer with the TSC
offset/multiplier, the TSC timer interrupts is injected to the guest at the
wrong time if the KVM TSC offset is different from what the TDX module
determined.
Originally this issue was found by cyclic test of rt-test [1] as the
latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It
turned out that the KVM TSC offset is different from what the TDX module
determines.
Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier
somehow, thus keeping the KVM TSC offset/multiplier the same as the
value of the TDX module. Writes to MSR_IA32_TSC are also blocked as
they amount to a change in the TSC offset.
[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in
common x86, so that SVM can also use the snapshot.
Opportunistically change the field to a u64. While bits 63:32 are reserved
on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned
long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Open code the filling of vcpu->arch.exception in kvm_requeue_exception()
instead of bouncing through kvm_multiple_exception(), as re-injection
doesn't actually share that much code with "normal" injection, e.g. the
VM-Exit interception check, payload delivery, and nested exception code
is all bypassed as those flows only apply during initial injection.
When FRED comes along, the special casing will only get worse, as FRED
explicitly tracks nested exceptions and essentially delivers the payload
on the stack frame, i.e. re-injection will need more inputs, and normal
injection will have yet more code that needs to be bypassed when KVM is
re-injecting an exception.
No functional change intended.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20241001050110.3643764-2-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename send_user_only to avoid "user", because KVM's ABI is to not inject
page faults into CPL0, whereas "user" in x86 is specifically CPL3. Invert
the polarity to keep the naming simple and unambiguous. E.g. while KVM
often refers to CPL0 as "kernel", that terminology isn't ubiquitous, and
"send_kernel" could be misconstrued as "send only to kernel".
Link: https://lore.kernel.org/r/20250215010609.1199982-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that all KVM usage of the Xen HVM config information is buried behind
CONFIG_KVM_XEN=y, move the per-VM kvm_xen_hvm_config field out of kvm_arch
and into kvm_xen.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that all references to kvm_vcpu_arch.xen_hvm_config are wrapped with
CONFIG_KVM_XEN #ifdefs, bury the field itself behind CONFIG_KVM_XEN=y.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250215011437.1203084-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Steal another bit from rmap entries (which are word aligned pointers, i.e.
have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
the bit to implement a *very* rudimentary per-rmap spinlock. The only
anticipated usage of the lock outside of mmu_lock is for aging gfns, and
collisions between aging and other MMU rmap operations are quite rare,
e.g. unless userspace is being silly and aging a tiny range over and over
in a tight loop, time between contention when aging an actively running VM
is O(seconds). In short, a more sophisticated locking scheme shouldn't be
necessary.
Note, the lock only protects the rmap structure itself, SPTEs that are
pointed at by a locked rmap can still be modified and zapped by another
task (KVM drops/zaps SPTEs before deleting the rmap entries)
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-10-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Walk the TDP MMU in an RCU read-side critical section without holding
mmu_lock when harvesting and potentially updating age information on
TDP MMU SPTEs. Add a new macro to do RCU-safe walking of TDP MMU roots,
and do all SPTE aging with atomic updates; while clobbering Accessed
information is ok, KVM must not corrupt other bits, e.g. must not drop
a Dirty or Writable bit when making a SPTE young..
If updating a SPTE to mark it for access tracking fails, leave it as is
and treat it as if it were young. If the spte is being actively modified,
it is most likely young.
Acquire and release mmu_lock for write when harvesting age information
from the shadow MMU, as the shadow MMU doesn't yet support aging outside
of mmu_lock.
Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove the per-vCPU "cache" of the reference pvclock and instead cache
only the TSC shift+multiplier. All other fields in pvclock are fully
recomputed by kvm_guest_time_update(), i.e. aren't actually persisted.
In addition to shaving a few bytes, explicitly tracking the TSC shift/mul
fields makes it easier to see that those fields are tied to hw_tsc_khz
(they exist to avoid having to do expensive math in the common case).
And conversely, not tracking the other fields makes it easier to see that
things like the version number are pulled from the guest's copy, not from
KVM's reference.
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20250201013827.680235-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Defer runtime CPUID updates until the next non-faulting CPUID emulation
or KVM_GET_CPUID2, which are the only paths in KVM that consume the
dynamic entries. Deferring the updates is especially beneficial to
nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
changes, not to mention the updates don't need to be realized while L2 is
active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept
on Intel, but not AMD).
Deferring CPUID updates shaves several hundred cycles from nested VMX
roundtrips, as measured from L2 executing CPUID in a tight loop:
SKX 6850 => 6450
ICX 9000 => 8800
EMR 7900 => 7700
Alternatively, KVM could update only the CPUID leaves that are affected
by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
but that adds non-trivial complexity and doesn't solve the underlying
problem of nested transitions potentially changing both XCR0 and XSS, on
both nested VM-Enter and VM-Exit.
Skipping updates entirely if L2 is active and CPUID is being intercepted
by L1 could work for the common case. However, simply skipping updates if
L2 is active is *very* subtly dangerous and complex. Most KVM updates are
triggered by changes to the current vCPU state, which may be L2 state,
whereas performing updates only for L1 would requiring detecting changes
to L1 state. KVM would need to either track relevant L1 state, or defer
runtime CPUID updates until the next nested VM-Exit. The former is ugly
and complex, while the latter comes with similar dangers to deferring all
CPUID updates, and would only address the nested VM-Enter path.
To guard against using stale data, disallow querying dynamic CPUID feature
bits, i.e. features that KVM updates at runtime, via a compile-time
assertion in guest_cpu_cap_has(). Exempt MWAIT from the rule, as the
MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
feature.
Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
of worms. MONITOR/MWAIT can't be virtualized (for a reasonable definition),
and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
wrong for other reasons.
Beyond the aforementioned feature bits, the only other dynamic CPUID
(sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
CPUID entries in KVM is all but guaranteed to be a bug. The layout for an
actual XSAVE buffer depends on the format (compacted or not) and
potentially the features that are actually enabled. E.g. see the logic in
fpstate_clear_xstate_component() needed to poke into the guest's effective
XSAVE state to clear MPX state on INIT. KVM does consume
CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
but not EBX, which is the only dynamic output register in the leaf.
Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the conditional loading of hardware DR6 with the guest's DR6 value
out of the core .vcpu_run() loop to fix a bug where KVM can load hardware
with a stale vcpu->arch.dr6.
When the guest accesses a DR and host userspace isn't debugging the guest,
KVM disables DR interception and loads the guest's values into hardware on
VM-Enter and saves them on VM-Exit. This allows the guest to access DRs
at will, e.g. so that a sequence of DR accesses to configure a breakpoint
only generates one VM-Exit.
For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also
identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest)
and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading
DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop.
But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas
VMX requires software to manually load the guest value, and so loading the
guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done
_inside_ the core run loop.
Unfortunately, saving the guest values on VM-Exit is initiated by common
x86, again outside of the core run loop. If the guest modifies DR6 (in
hardware, when DR interception is disabled), and then the next VM-Exit is
a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and
clobber the guest's actual value.
The bug shows up primarily with nested VMX because KVM handles the VMX
preemption timer in the fastpath, and the window between hardware DR6
being modified (in guest context) and DR6 being read by guest software is
orders of magnitude larger in a nested setup. E.g. in non-nested, the
VMX preemption timer would need to fire precisely between #DB injection
and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the
window where hardware DR6 is "dirty" extends all the way from L1 writing
DR6 to VMRESUME (in L1).
L1's view:
==========
<L1 disables DR interception>
CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0
A: L1 Writes DR6
CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1
B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec
D: L1 reads DR6, arch.dr6 = 0
CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0
CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0
L2 reads DR6, L1 disables DR interception
CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216
CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0
CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0
L2 detects failure
CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT
L1 reads DR6 (confirms failure)
CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0
L0's view:
==========
L2 reads DR6, arch.dr6 = 0
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
L2 => L1 nested VM-Exit
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23
L1 writes DR7, L0 disables DR interception
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007
CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23
L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0
A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'>
B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER
CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23
C: L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0
L1 => L2 nested VM-Enter
CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME
L0 reads DR6, arch.dr6 = 0
Reported-by: John Stultz <jstultz@google.com>
Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com
Fixes: 375e28ffc0cf ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT")
Fixes: d67668e9dd76 ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove the "iommu_domain" and "iommu_noncoherent" fields from struct
kvm_arch, which are no longer used since commit ad6260da1e23 ("KVM: x86:
drop legacy device assignment").
Signed-off-by: Ted Chen <znscnchen@gmail.com>
Link: https://lore.kernel.org/r/20250124075055.97158-1-znscnchen@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pull kvm updates from Paolo Bonzini:
"Loongarch:
- Clear LLBCTL if secondary mmu mapping changes
- Add hypercall service support for usermode VMM
x86:
- Add a comment to kvm_mmu_do_page_fault() to explain why KVM
performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
enabled
- Ensure that all SEV code is compiled out when disabled in Kconfig,
even if building with less brilliant compilers
- Remove a redundant TLB flush on AMD processors when guest CR4.PGE
changes
- Use str_enabled_disabled() to replace open coded strings
- Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
APICv cache prior to every VM-Enter
- Overhaul KVM's CPUID feature infrastructure to track all vCPU
capabilities instead of just those where KVM needs to manage state
and/or explicitly enable the feature in hardware. Along the way,
refactor the code to make it easier to add features, and to make it
more self-documenting how KVM is handling each feature
- Rework KVM's handling of VM-Exits during event vectoring; this
plugs holes where KVM unintentionally puts the vCPU into infinite
loops in some scenarios (e.g. if emulation is triggered by the
exit), and brings parity between VMX and SVM
- Add pending request and interrupt injection information to the
kvm_exit and kvm_entry tracepoints respectively
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU
when loading guest/host PKRU, due to a refactoring of the kernel
helpers that didn't account for KVM's pre-checking of the need to
do WRPKRU
- Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not.
Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
went to userspace, and all the others did not; the new code need
not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not
- As part of enabling TDX virtual machines, support support
separation of private/shared EPT into separate roots.
When TDX will be enabled, operations on private pages will need to
go through the privileged TDX Module via SEAMCALLs; as a result,
they are limited and relatively slow compared to reading a PTE.
The patches included in 6.14 allow KVM to keep a mirror of the
private EPT in host memory, and define entries in kvm_x86_ops to
operate on external page tables such as the TDX private EPT
- The recently introduced conversion of the NX-page reclamation
kthread to vhost_task moved the task under the main process. The
task is created as soon as KVM_CREATE_VM was invoked and this, of
course, broke userspace that didn't expect to see any child task of
the VM process until it started creating its own userspace threads.
In particular crosvm refuses to fork() if procfs shows any child
task, so unbreak it by creating the task lazily. This is arguably a
userspace bug, as there can be other kinds of legitimate worker
tasks and they wouldn't impede fork(); but it's not like userspace
has a way to distinguish kernel worker tasks right now. Should they
show as "Kthread: 1" in proc/.../status?
x86 - Intel:
- Fix a bug where KVM updates hardware's APICv cache of the highest
ISR bit while L2 is active, while ultimately results in a
hardware-accelerated L1 EOI effectively being lost
- Honor event priority when emulating Posted Interrupt delivery
during nested VM-Enter by queueing KVM_REQ_EVENT instead of
immediately handling the interrupt
- Rework KVM's processing of the Page-Modification Logging buffer to
reap entries in the same order they were created, i.e. to mark gfns
dirty in the same order that hardware marked the page/PTE dirty
- Misc cleanups
Generic:
- Cleanup and harden kvm_set_memory_region(); add proper lockdep
assertions when setting memory regions and add a dedicated API for
setting KVM-internal memory regions. The API can then explicitly
disallow all flags for KVM-internal memory regions
- Explicitly verify the target vCPU is online in kvm_get_vcpu() to
fix a bug where KVM would return a pointer to a vCPU prior to it
being fully online, and give kvm_for_each_vcpu() similar treatment
to fix a similar flaw
- Wait for a vCPU to come online prior to executing a vCPU ioctl, to
fix a bug where userspace could coerce KVM into handling the ioctl
on a vCPU that isn't yet onlined
- Gracefully handle xarray insertion failures; even though such
failures are impossible in practice after xa_reserve(), reserving
an entry is always followed by xa_store() which does not know (or
differentiate) whether there was an xa_reserve() before or not
RISC-V:
- Zabha, Svvptc, and Ziccrse extension support for guests. None of
them require anything in KVM except for detecting them and marking
them as supported; Zabha adds byte and halfword atomic operations,
while the others are markers for specific operation of the TLB and
of LL/SC instructions respectively
- Virtualize SBI system suspend extension for Guest/VM
- Support firmware counters which can be used by the guests to
collect statistics about traps that occur in the host
Selftests:
- Rework vcpu_get_reg() to return a value instead of using an
out-param, and update all affected arch code accordingly
- Convert the max_guest_memory_test into a more generic
mmu_stress_test. The basic gist of the "conversion" is to have the
test do mprotect() on guest memory while vCPUs are accessing said
memory, e.g. to verify KVM and mmu_notifiers are working as
intended
- Play nice with treewrite builds of unsupported architectures, e.g.
arm (32-bit), as KVM selftests' Makefile doesn't do anything to
ensure the target architecture is actually one KVM selftests
supports
- Use the kernel's $(ARCH) definition instead of the target triple
for arch specific directories, e.g. arm64 instead of aarch64,
mainly so as not to be different from the rest of the kernel
- Ensure that format strings for logging statements are checked by
the compiler even when the logging statement itself is disabled
- Attempt to whack the last LLC references/misses mole in the Intel
PMU counters test by adding a data load and doing CLFLUSH{OPT} on
the data instead of the code being executed. It seems that modern
Intel CPUs have learned new code prefetching tricks that bypass the
PMU counters
- Fix a flaw in the Intel PMU counters test where it asserts that
events are counting correctly without actually knowing what the
events count given the underlying hardware; this can happen if
Intel reuses a formerly microarchitecture-specific event encoding
as an architectural event, as was the case for Top-Down Slots"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
kvm: defer huge page recovery vhost task to later
KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
KVM: Disallow all flags for KVM-internal memslots
KVM: x86: Drop double-underscores from __kvm_set_memory_region()
KVM: Add a dedicated API for setting KVM-internal memslots
KVM: Assert slots_lock is held when setting memory regions
KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
LoongArch: KVM: Add hypercall service support for usermode VMM
LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
KVM: VMX: read the PML log in the same order as it was written
KVM: VMX: refactor PML terminology
KVM: VMX: Fix comment of handle_vmx_instruction()
KVM: VMX: Reinstate __exit attribute for vmx_exit()
KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
KVM: x86: Use LVT_TIMER instead of an open coded literal
RISC-V: KVM: Add new exit statstics for redirected traps
RISC-V: KVM: Update firmware counters for various events
RISC-V: KVM: Redirect instruction access fault trap to guest
...
|
|
Some libraries want to ensure they are single threaded before forking,
so making the kernel's kvm huge page recovery process a vhost task of
the user process breaks those. The minijail library used by crosvm is
one such affected application.
Defer the task to after the first VM_RUN call, which occurs after the
parent process has forked all its jailed processes. This needs to happen
only once for the kvm instance, so introduce some general-purpose
infrastructure for that, too. It's similar in concept to pthread_once;
except it is actually usable, because the callback takes a parameter.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250123153543.2769928-1-kbusch@meta.com>
[Move call_once API to include/linux. - Paolo]
Cc: stable@vger.kernel.org
Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As part of enabling TDX virtual machines, support support separation of
private/shared EPT into separate roots.
Confidential computing solutions almost invariably have concepts of
private and shared memory, but they may different a lot in the details.
In SEV, for example, the bit is handled more like a permission bit as
far as the page tables are concerned: the private/shared bit is not
included in the physical address.
For TDX, instead, the bit is more like a physical address bit, with
the host mapping private memory in one half of the address space and
shared in another. Furthermore, the two halves are mapped by different
EPT roots and only the shared half is managed by KVM; the private half
(also called Secure EPT in Intel documentation) gets managed by the
privileged TDX Module via SEAMCALLs.
As a result, the operations that actually change the private half of
the EPT are limited and relatively slow compared to reading a PTE. For
this reason the design for KVM is to keep a mirror of the private EPT in
host memory. This allows KVM to quickly walk the EPT and only perform the
slower private EPT operations when it needs to actually modify mid-level
private PTEs.
There are thus three sets of EPT page tables: external, mirror and
direct. In the case of TDX (the only user of this framework) the
first two cover private memory, whereas the third manages shared
memory:
external EPT - Hidden within the TDX module, modified via TDX module
calls.
mirror EPT - Bookkeeping tree used as an optimization by KVM, not
used by the processor.
direct EPT - Normal EPT that maps unencrypted shared memory.
Managed like the EPT of a normal VM.
Modifying external EPT
----------------------
Modifications to the mirrored page tables need to also perform the
same operations to the private page tables, which will be handled via
kvm_x86_ops. Although this prep series does not interact with the TDX
module at all to actually configure the private EPT, it does lay the
ground work for doing this.
In some ways updating the private EPT is as simple as plumbing PTE
modifications through to also call into the TDX module; however, the
locking is more complicated because inserting a single PTE cannot anymore
be done atomically with a single CMPXCHG. For this reason, the existing
FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the
private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides
protecting operation of KVM, it limits the set of cases in which the
TDX module will encounter contention on its own PTE locks.
Zapping external EPT
--------------------
While the framework tries to be relatively generic, and to be
understandable without knowing TDX much in detail, some requirements of
TDX sometimes leak; for example the private page tables also cannot be
zapped while the range has anything mapped, so the mirrored/private page
tables need to be protected from KVM operations that zap any non-leaf
PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast().
For normal VMs, guest memory is zapped for several reasons: user
memory getting paged out by the guest, memslots getting deleted,
passthrough of devices with non-coherent DMA. Confidential computing
adds to these the conversion of memory between shared and privates. These
operations must not zap any private memory that is in use by the guest.
This is possible because the only zapping that is out of the control
of KVM/userspace is paging out userspace memory, which cannot apply to
guestmemfd operations. Thus a TDX VM will only zap private memory from
memslot deletion and from conversion between private and shared memory
which is triggered by the guest.
To avoid zapping too much memory, enums are introduced so that operations
can choose to target only private or shared memory, and thus only
direct or mirror EPT. For example:
Memslot deletion - Private and shared
MMU notifier based zapping - Shared only
Conversion to shared - Private only
Conversion to private - Shared only
Other cases of zapping will not be supported for KVM, for example
APICv update or non-coherent DMA status update; for the latter, TDX will
simply require that the CPU supports self-snoop and honor guest PAT
unconditionally for shared memory.
|
|
Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
|
|
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
- Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
|
|
Switch to using hvhdk.h everywhere in the kernel. This header
includes all the new Hyper-V headers in include/hyperv, which form a
superset of the definitions found in hyperv-tlfs.h.
This makes it easier to add new Hyper-V interfaces without being
restricted to those in the TLFS doc (reflected in hyperv-tlfs.h).
To be more consistent with the original Hyper-V code, the names of
some definitions are changed slightly. Update those where needed.
Update comments in mshyperv.h files to point to include/hyperv for
adding new definitions.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com
Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Remove includes of linux/hyperv.h, mshyperv.h, and hyperv-tlfs.h where
they are not used.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-3-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1732577084-2122-3-git-send-email-nunodasneves@linux.microsoft.com>
|
|
Integrate hooks for mirroring page table operations for cases where TDX
will zap PTEs or free page tables.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from the guest. Now add
plumbing to propagate changes to the "external" page tables being
mirrored. Just create the x86_ops for now, leave plumbing the operations
into the TDX module for future patches.
Add two operations for tearing down page tables, one for freeing page
tables (free_external_spt) and one for zapping PTEs (remove_external_spte).
Define them such that remove_external_spte will perform a TLB flush as
well. (in TDX terms "ensure there are no active translations").
TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For zapping/freeing, only
hook tdp_mmu_iter_set_spte() which is used for mapping and linking PTs.
Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for
zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and
kvm_mmu_zap_all_fast().
In previous changes to address races around concurrent populating using
tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set
FROZEN_SPTE in the mirrored page tables while performing the external
operations. Such a solution is not needed for the tear down paths in TDX
as these will always be performed with the mmu_lock held for write.
Sprinkle some KVM_BUG_ON()s to reflect this.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-16-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Integrate hooks for mirroring page table operations for cases where TDX
will set PTEs or link page tables.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from any actual guest
mapping. Now add plumbing to propagate changes to the "external" page
tables being mirrored. Just create the x86_ops for now, leave plumbing the
operations into the TDX module for future patches.
Add two operations for setting up external page tables, one for linking
new page tables and one for setting leaf PTEs. Don't add any op for
configuring the root PFN, as TDX handles this itself. Don't provide a
way to set permissions on the PTEs also, as TDX doesn't support it.
This results in MMU "mirroring" support that is very targeted towards TDX.
Since it is likely there will be no other user, the main benefit of making
the support generic is to keep TDX specific *looking* code outside of the
MMU. As a generic feature it will make enough sense from TDX's
perspective. For developers unfamiliar with TDX arch it can express the
general concepts such that they can continue to work in the code.
TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For setting/linking, only
hook tdp_mmu_set_spte_atomic() which is used for mapping and linking
PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
setting PTEs in operations unsupported by TDX: splitting huge pages and
write protecting. Sprinkle KVM_BUG_ON()s to document as code that these
paths are not supported for mirrored page tables. For zapping operations,
leave those for near future changes.
Many operations in the TDP MMU depend on atomicity of the PTE update.
While the mirror PTE on KVM's side can be updated atomically, the update
that happens inside the external operations (S-EPT updates via TDX module
call) can't happen atomically with the mirror update. The following race
could result during two vCPU's populating private memory:
* vcpu 1: atomically update 2M level mirror EPT entry to be present
* vcpu 2: read 2M level EPT entry that is present
* vcpu 2: walk down into 4K level EPT
* vcpu 2: atomically update 4K level mirror EPT entry to be present
* vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error
because 2M secure EPT entry is not populated yet
* vcpu 1: link_external_spt() to update 2M secure EPT entry
Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect
operations are performed. Only write the actual mirror PTE value once the
reflect operations have completed. When trying to set a PTE to present and
encountering a frozen SPTE, retry the fault.
By doing this the race is prevented as follows:
* vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE
* vcpu 2: read 2M level EPT entry that is FROZEN_SPTE
* vcpu 2: find that the EPT entry is frozen
abandon page table walk to resume guest execution
* vcpu 1: link_external_spt() to update 2M secure EPT entry
* vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
* vcpu 2: resume guest execution
Depending on vcpu 1 state, vcpu 2 may result in EPT violation
again or make progress on guest execution
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-15-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add the ability for the TDP MMU to maintain a mirror of a separate
mapping.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
and it will make the code confusing when being read from their
perspective. Special operations added for TDX that set private or zap
private memory will have nothing to do with these other private memory
technologies. (SEV, etc).
To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.
private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
mirror PT root | external PT root
| | |
V | V
mirror PT --hook to propagate-->external PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|
Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.
In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct
roots to private/shared with conditionals. It could also be implemented by
making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line
up such that conversion could be a direct assignment with a case. Don't do
this because the mapping of private to mirrored is confusing enough. So it
is worth not hiding the logic in type casting.
Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
rebuilt like a normal root. It needs to persist for the lifetime of the VM.
The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_external_spt().
Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid
in implementing TDX shared memory.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.
So KVM will either need to apply or strip the shared bit before mapping or
zapping the shared EPT. Having GFNs sometimes have the shared bit and
sometimes not would make the code confusing.
So instead arrange the code such that GFNs never have shared bit set.
Create a concept of "direct bits", that is stripped from the fault
address when setting fault->gfn, and applied within the TDP MMU iterator.
Calling code will behave as if it is operating on the PTE mapping the GFN
(without shared bits) but within the iterator, the actual mappings will be
shifted using bits specific for the root. SPs will have the GFN set
without the shared bit. In the end the TDP MMU will behave like it is
mapping things at the GFN without the shared bit but with a strange page
table format where everything is offset by the shared bit.
Since TDX only needs to shift the mapping like this for the shared bit,
which is mapped as the normal TDP root, add a "gfn_direct_bits" field to
the kvm_arch structure for each VM with a default value of 0. It will
have the bit set at the position of the GPA shared bit in GFN through TD
specific initialization code. Keep TDX specific concepts out of the MMU
code by not naming it "shared".
Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN
ranges. In convention established above, these would need to target the
shifted GFN range. It won't matter functionally, since the actual
implementation will always result in a full flush for the only planned
user (TDX). For correctness reasons, future changes can provide a TDX
x86_ops.flush_remote_tlbs_range implementation to return -EOPNOTSUPP and
force the full flush for TDs.
This leaves one problem. Some operations use a concept of max GFN (i.e.
kvm_mmu_max_gfn()), to iterate over the whole TDP range. When applying the
direct mask to the start of the range, the iterator would end up skipping
iterating over the range not covered by the direct mask bit. For safety,
make sure the __tdp_mmu_zap_root() operation iterates over the full GFN
range supported by the underlying TDP format. Add a new iterator helper,
for_each_tdp_pte_min_level_all(), that iterates the entire TDP GFN range,
regardless of root.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-9-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Introduce a "is_mirror" member to the kvm_mmu_page_role union to identify
SPTEs associated with the mirrored EPT.
The TDX module maintains the private half of the EPT mapped in the TD in
its protected memory. KVM keeps a copy of the private GPAs in a mirrored
EPT tree within host memory. This "is_mirror" attribute enables vCPUs to
find and get the root page of mirrored EPT from the MMU root list for a
guest TD. This also allows KVM MMU code to detect changes in mirrored EPT
according to the "is_mirror" mmu page role and propagate the changes to
the private EPT managed by TDX module.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-6-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add an external pointer to struct kvm_mmu_page for TDX's private page table
and add helper functions to allocate/initialize/free a private page table
page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU
doesn't use unsync_children and write_flooding_count, pack them to have
room for a pointer and use a union to avoid memory overhead.
For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.
When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table allocate two sets of page tables for the private
half of the GPA space.
For the page tables that KVM will walk, allocate them like normal and refer
to them as mirror page tables. Additionally allocate one more page for the
page tables the CPU will walk, and call them external page tables. Resolve
the KVM page fault with the existing code, and do additional operations
necessary for modifying the external page table in future patches.
The relationship of the types of page tables in this scheme is depicted
below:
KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root mirror PT root | private PT root
| | | |
V V | V
shared PT mirror PT --propagate--> external PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT - Page table
Shared PT - Visible to KVM, and the CPU uses it for shared mappings.
External PT - The CPU uses it, but it is invisible to KVM. TDX module
updates this table to map private guest pages.
Mirror PT - It is visible to KVM, but the CPU doesn't use it. KVM uses
it to propagate PT change to the actual private PT.
Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
to the TDX vm type.
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move the declarations for the hypercall emulation APIs to x86.h. While the
helpers are exported, they are intended to be consumed only by KVM vendor
modules, i.e. don't need to be exposed to the kernel at-large.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20241128004344.4072099-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Remove the redundant .hwapic_irr_update() ops.
If a vCPU has APICv enabled, KVM updates its RVI before VM-enter to L1
in vmx_sync_pir_to_irr(). This guarantees RVI is up-to-date and aligned
with the vIRR in the virtual APIC. So, no need to update RVI every time
the vIRR changes.
Note that KVM never updates vmcs02 RVI in .hwapic_irr_update() or
vmx_sync_pir_to_irr(). So, removing .hwapic_irr_update() has no
impact to the nested case.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241111085947.432645-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add VMX/SVM specific interrupt injection info the kvm_entry tracepoint.
As is done with kvm_exit, gather the information via a kvm_x86_ops hook
to avoid the moderately costly VMREADs on VMX when the tracepoint isn't
enabled.
Opportunistically rename the parameters in the get_exit_info()
declaration to match the names used by both SVM and VMX.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240910200350.264245-2-mlevitsk@redhat.com
[sean: drop is_guest_mode() change, use intr_info/error_code for names]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move handling of emulation during event vectoring, which KVM doesn't
support, into VMX's check_emulate_instruction(), so that KVM detects
all unsupported emulation, not just cached emulated MMIO (EPT misconfig).
E.g. on emulated MMIO that isn't cached (EPT Violation) or occurs with
legacy shadow paging (#PF).
Rejecting emulation on other sources of emulation also fixes a largely
theoretical flaw (thanks to the "unprotect and retry" logic), where KVM
could incorrectly inject a #DF:
1. CPU executes an instruction and hits a #GP
2. While vectoring the #GP, a shadow #PF occurs
3. On the #PF VM-Exit, KVM re-injects #GP
4. KVM emulates because of the write-protected page
5. KVM "successfully" emulates and also detects the #GP
6. KVM synthesizes a #GP, and since #GP has already been injected,
incorrectly escalates to a #DF.
Fix the comment about EMULTYPE_PF as this flag doesn't necessarily
mean MMIO anymore: it can also be set due to the write protection
violation.
Note, handle_ept_misconfig() checks vmx_check_emulate_instruction() before
attempting emulation of any kind.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-5-iorlov@amazon.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract VMX code for unhandleable VM-Exit during vectoring into
vendor-agnostic function so that boiler-plate code can be shared by SVM.
To avoid unnecessarily complexity in the helper, unconditionally report a
GPA to userspace instead of having a conditional entry. For exits that
don't report a GPA, i.e. everything except EPT Misconfig, simply report
KVM's "invalid GPA".
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-2-iorlov@amazon.com
[sean: clarify that the INVALID_GPA logic is new]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Replace the internals of the governed features framework with a more
comprehensive "guest CPU capabilities" implementation, i.e. with a guest
version of kvm_cpu_caps. Keep the skeleton of governed features around
for now as vmx_adjust_sec_exec_control() relies on detecting governed
features to do the right thing for XSAVES, and switching all guest feature
queries to guest_cpu_cap_has() requires subtle and non-trivial changes,
i.e. is best done as a standalone change.
Tracking *all* guest capabilities that KVM cares will allow excising the
poorly named "governed features" framework, and effectively optimizes all
KVM queries of guest capabilities, i.e. doesn't require making a
subjective decision as to whether or not a feature is worth "governing",
and doesn't require adding the code to do so.
The cost of tracking all features is currently 92 bytes per vCPU on 64-bit
kernels: 100 bytes for cpu_caps versus 8 bytes for governed_features.
That cost is well worth paying even if the only benefit was eliminating
the "governed features" terminology. And practically speaking, the real
cost is zero unless those 92 bytes pushes the size of vcpu_vmx or vcpu_svm
into a new order-N allocation, and if that happens there are better ways
to reduce the footprint of kvm_vcpu_arch, e.g. making the PMU and/or MTRR
state separate allocations.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM only searches for KVM's PV CPUID base when userspace sets
guest CPUID, drop the cache and simply do the search every time.
Practically speaking, this is a nop except for situations where userspace
sets CPUID _after_ running the vCPU, which is anything but a hot path,
e.g. QEMU does so only when hotplugging a vCPU. And on the flip side,
caching guest CPUID information, especially information that is used to
query/modify _other_ CPUID state, is inherently dangerous as it's all too
easy to use stale information, i.e. KVM should only cache CPUID state when
the performance and/or programming benefits justify it.
Link: https://lore.kernel.org/r/20241128013424.4096668-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pass the target vCPU to the hwapic_isr_update() vendor hook so that VMX
can defer the update until after nested VM-Exit if an EOI for L1's vAPIC
occurs while L2 is active.
Note, commit d39850f57d21 ("KVM: x86: Drop @vcpu parameter from
kvm_x86_ops.hwapic_isr_update()") removed the parameter with the
justification that doing so "allows for a decent amount of (future)
cleanup in the APIC code", but it's not at all clear what cleanup was
intended, or if it was ever realized.
No functional change intended.
Cc: stable@vger.kernel.org
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20241128000010.4051275-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
kvm_vm_create_worker_thread() is meant to be used for kthreads that
can consume significant amounts of CPU time on behalf of a VM or in
response to how the VM behaves (for example how it accesses its memory).
Therefore it wants to charge the CPU time consumed by that work to
the VM's container.
However, because of these threads, cgroups which have kvm instances
inside never complete freezing. This can be trivially reproduced:
root@test ~# mkdir /sys/fs/cgroup/test
root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs
root@test ~# qemu-system-x86_64 -nographic -enable-kvm
and in another terminal:
root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
root@test ~# cat /sys/fs/cgroup/test/cgroup.events
populated 1
frozen 0
The cgroup freezing happens in the signal delivery path but
kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never
calls into the signal delivery path and thus never gets frozen. Because
the cgroup freezer determines whether a given cgroup is frozen by
comparing the number of frozen threads to the total number of threads
in the cgroup, the cgroup never becomes frozen and users waiting for
the state transition may hang indefinitely.
Since the worker kthread is tied to a user process, it's better if
it behaves similarly to user tasks as much as possible, including
being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all
that kvm_vm_create_worker_thread() wanted to be and more: not only it
inherits the userspace process's cgroups, it has other niceties like
being parented properly in the process tree. Use it instead of the
homegrown alternative.
Incidentally, the new code is also better behaved when you flip recovery
back and forth to disabled and back to enabled. If your recovery period
is 1 minute, it will run the next recovery after 1 minute independent
of how many times you flipped the parameter.
(Commit message based on emails from Tejun).
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Luca Boccassi <bluca@debian.org>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Luca Boccassi <bluca@debian.org>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM x86 misc changes for 6.13
- Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
- Quirk KVM's misguided behavior of initialized certain feature MSRs to
their maximum supported feature set, which can result in KVM creating
invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero
value results in the vCPU having invalid state if userspace hides PDCM
from the guest, which can lead to save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support LA57
to better follow the "architecture", in quotes because the actual
behavior is poorly documented. E.g. most MSR writes and descriptor
table loads ignore CR4.LA57 and operate purely on whether the CPU
supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(), as
filling the cache from IRQ context is generally unsafe, and harden the
cache accessors to try to prevent similar issues from occuring in the
future.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
- Minor cleanups
|
|
Drop the per-VM zapped_obsolete_pages list now that the usage from the
defunct mmu_shrinker is gone, and instead use a local list to track pages
in kvm_zap_obsolete_pages(), the sole remaining user of
zapped_obsolete_pages.
Opportunistically add an assertion to verify and document that slots_lock
must be held, i.e. that there can only be one active instance of
kvm_zap_obsolete_pages() at any given time, and by doing so also prove
that using a local list instead of a per-VM list doesn't change any
functionality (beyond trivialities like list initialization).
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Recover TDP MMU huge page mappings in-place instead of zapping them when
dirty logging is disabled, and rename functions that recover huge page
mappings when dirty logging is disabled to move away from the "zap
collapsible spte" terminology.
Before KVM flushes TLBs, guest accesses may be translated through either
the (stale) small SPTE or the (new) huge SPTE. This is already possible
when KVM is doing eager page splitting (where TLB flushes are also
batched), and when vCPUs are faulting in huge mappings (where TLBs are
flushed after the new huge SPTE is installed).
Recovering huge pages reduces the number of page faults when dirty
logging is disabled:
$ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g
Before: 393,599 kvm:kvm_page_fault
After: 262,575 kvm:kvm_page_fault
vCPU throughput and the latency of disabling dirty-logging are about
equal compared to zapping, but avoiding faults can be beneficial to
remove vCPU jitter in extreme scenarios.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a quirk to control KVM's misguided initialization of select feature
MSRs to KVM's max configuration, as enabling features by default violates
KVM's approach of letting userspace own the vCPU model, and is actively
problematic for MSRs that are conditionally supported, as the vCPU will
end up with an MSR value that userspace can't restore. E.g. if the vCPU
is configured with PDCM=0, userspace will save and attempt to restore a
non-zero PERF_CAPABILITIES, thanks to KVM's meddling.
Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When querying guest CPL to determine if a vCPU was preempted while in
kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from
the VMCS on Intel CPUs. If the kernel is running with full preemption
enabled, using the register cache in the preemption path can result in
stale and/or uninitialized data being cached in the segment cache.
In particular the following scenario is currently possible:
- vCPU is just created, and the vCPU thread is preempted before
SS.AR_BYTES is written in vmx_vcpu_reset().
- When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() =>
vmx_get_cpl() reads and caches '0' for SS.AR_BYTES.
- vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't
invoke vmx_segment_cache_clear() to invalidate the cache.
As a result, KVM retains a stale value in the cache, which can be read,
e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX
segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM
selftests) reads and writes system registers just after the vCPU was
created, _without_ modifying SS.AR_BYTES, userspace will write back the
stale '0' value and ultimately will trigger a VM-Entry failure due to
incorrect SS segment type.
Note, the VM-Enter failure can also be avoided by moving the call to
vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all
segments. However, while that change is correct and desirable (and will
come along shortly), it does not address the underlying problem that
accessing KVM's register caches from !task context is generally unsafe.
In addition to fixing the immediate bug, bypassing the cache for this
particular case will allow hardening KVM register caching log to assert
that the caches are accessed only when KVM _knows_ it is safe to do so.
Fixes: de63ad4cf497 ("KVM: X86: implement the logic for spinlock optimization")
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM VMX changes for 6.12:
- Set FINAL/PAGE in the page fault error code for EPT Violations if and only
if the GVA is valid. If the GVA is NOT valid, there is no guest-side page
table walk and so stuffing paging related metadata is nonsensical.
- Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of
emulating posted interrupt delivery to L2.
- Add a lockdep assertion to detect unsafe accesses of vmcs12 structures.
- Harden eVMCS loading against an impossible NULL pointer deref (really truly
should be impossible).
- Minor SGX fix and a cleanup.
|
|
KVM x86 MMU changes for 6.12:
- Overhaul the "unprotect and retry" logic to more precisely identify cases
where retrying is actually helpful, and to harden all retry paths against
putting the guest into an infinite retry loop.
- Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in
the shadow MMU.
- Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for
adding MGLRU support in KVM.
- Misc cleanups
|
|
KVM x86 misc changes for 6.12
- Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10
functionality that is on the horizon).
- Rework common MSR handling code to suppress errors on userspace accesses to
unsupported-but-advertised MSRs. This will allow removing (almost?) all of
KVM's exemptions for userspace access to MSRs that shouldn't exist based on
the vCPU model (the actual cleanup is non-trivial future work).
- Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the
64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv)
stores the entire 64-bit value a the ICR offset.
- Fix a bug where KVM would fail to exit to userspace if one was triggered by
a fastpath exit handler.
- Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when
there's already a pending wake event at the time of the exit.
- Finally fix the RSM vs. nested VM-Enter WARN by forcing the vCPU out of
guest mode prior to signalling SHUTDOWN (architecturally, the SHUTDOWN is
supposed to hit L1, not L2).
|