summaryrefslogtreecommitdiff
path: root/arch/x86/include/asm/kvm_host.h
AgeCommit message (Collapse)Author
2025-03-14KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PATYan Zhao
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have KVM ignore guest PAT when this quirk is enabled. On AMD platforms, KVM always honors guest PAT. On Intel however there are two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop is not supported. Second, UC access on certain Intel platforms can be very slow[1] and honoring guest PAT on those platforms may break some old guests that accidentally specify video RAM as UC. Those old guests may never expect the slowness since KVM always forces WB previously. See [2]. So, introduce a quirk that KVM can enable by default on all Intel platforms to avoid breaking old unmodifiable guests. Newer userspace can disable this quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail if self-snoop is not supported, i.e. if KVM cannot obey the wish. The quirk is a no-op on AMD and also if any assigned devices have non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is another example of a quirk that is sometimes automatically disabled. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1] Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2] Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com> [Use supported_quirks/inapplicable_quirks to support both AMD and no-self-snoop cases, as well as to remove the shadow_memtype_mask check from kvm_mmu_may_ignore_guest_pat(). - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Allow vendor code to disable quirksPaolo Bonzini
In some cases, the handling of quirks is split between platform-specific code and generic code, or it is done entirely in generic code, but the relevant bug does not trigger on some platforms; for example, this will be the case for "ignore guest PAT". Allow unaffected vendor modules to disable handling of a quirk for all VMs via a new entry in kvm_caps. Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API tells userspace that KVM *knows* that some of its past behavior was bogus or just undesirable. In other words, it's plausible for userspace to refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so preserve that and make it part of the API. As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on Intel systems. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add support for find pending IRQ in a protected local APICSean Christopherson
Add flag and hook to KVM's local APIC management to support determining whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual APIC page is owned by the TDX module and cannot be accessed by KVM. As a result, registers that are virtualized by the CPU, e.g. PPR, cannot be read or written by KVM. To deliver interrupts for TDX guests, KVM must send an IRQ to the CPU on the posted interrupt notification vector. And to determine if TDX vCPU has a pending interrupt, KVM must check if there is an outstanding notification. Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is protected to short-circuit the various other flows that try to pull an IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is pending, KVM can't do anything based on _which_ IRQ is pending. Intentionally omit sanity checks from other flows, e.g. PPR update, so as not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM and userspace will never reach those flows for TDX guests, but reaching them is not fatal if something does go awry. For the TD exits not due to HLT TDCALL, skip checking RVI pending in tdx_protected_apic_has_interrupt(). Except for the guest being stupid (e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to have an interrupt in RVI that is fully unmasked. There is no any CPU flows that modify RVI in the middle of instruction execution. I.e. if RVI is non-zero, then either the interrupt has been pending since before the TD exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM doesn't care about STI/SS shadows outside of the HALTED case. And if the interrupt was pending before TD exit, then it _must_ be blocked, otherwise the interrupt would have been serviced at the instruction boundary. For the HLT TDCALL case, it will be handled in a future patch when HLT TDCALL is supported. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behaviorIsaku Yamahata
Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest DRs. TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter, and resets DRs to architectural INIT state on TD exit. Use the new flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to save/restore guest DRs. KVM still needs to restore host DRs after TD exit if there are active breakpoints in the host, which is covered by the existing code. MOV-DR exiting is always cleared for TDX guests, so the handler for DR access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH are set. Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT(). Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsrChao Gao
Several MSRs are constant and only used in userspace(ring 3). But VMs may have different values. KVM uses kvm_set_user_return_msr() to switch to guest's values and leverages user return notifier to restore them when the kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also caches the value it wrote to an MSR last time. TDX module unconditionally resets some of these MSRs to architectural INIT state on TD exit. It makes the cached values in kvm_user_return_msrs are inconsistent with values in hardware. This inconsistency needs to be fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring some MSRs to the host's values. kvm_set_user_return_msr() can help correct this case, but it is not optimal as it always does a wrmsr. So, introduce a variation of kvm_set_user_return_msr() to update cached values and skip that wrmsr. Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-9-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Make cpu_dirty_log_size a per-VM valueYan Zhao
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled. Do not set it for TDs. Until now, cpu_dirty_log_size was a system-wide value that is used for all VMs and is set to the PML buffer size when PML was enabled in VMX. However, PML is not currently supported for TDs, though PML remains available for normal VMs as long as the feature is supported by hardware and enabled in VMX. Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer size for normal VMs and 0 for TDs. This allows functions like kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to determine if PML is supported, in order to kick off vCPUs or request them to update CPU dirty logging status (turn on/off PML in VMCS). This fixes an issue first reported in [1], where QEMU attaches an emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD. KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx struct for a TDX VM. Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com> Reported-by: Pedro Principeza <pedro.principeza@canonical.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://github.com/canonical/tdx/issues/202 Link: https://github.com/canonical/tdx/issues/202 [1] Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Do TDX specific vcpu initializationIsaku Yamahata
TD guest vcpu needs TDX specific initialization before running. Repurpose KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command KVM_TDX_INIT_VCPU, and implement the callback for it. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/ (Sean) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: create/destroy VM structureIsaku Yamahata
Implement managing the TDX private KeyID to implement, create, destroy and free for a TDX guest. When creating at TDX guest, assign a TDX private KeyID for the TDX guest for memory encryption, and allocate pages for the guest. These are used for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS). On destruction, free the allocated pages, and the KeyID. Before tearing down the private page tables, TDX requires the guest TD to be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops hook. The TDR control structures can be freed in the vm_destroy() hook, which runs last. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix build issue in kvm-coco-queue - Init ret earlier to fix __tdx_td_init() error handling. (Chao) - Standardize -EAGAIN for __tdx_td_init() retry errors (Rick) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Add infrastructure for secure TSCIsaku Yamahata
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit changing TSC offset/multiplier when guest_tsc_protected is true. X86 confidential computing technology defines protected guest TSC so that the VMM can't change the TSC offset/multiplier once vCPU is initialized. SEV-SNP defines Secure TSC as optional, whereas TDX mandates it. KVM has common logic on x86 that tries to guess or adjust TSC offset/multiplier for better guest TSC and TSC interrupt latency at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes (kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR (kvm_set_msr_common()). The current x86 KVM implementation conflicts with protected TSC because the VMM can't change the TSC offset/multiplier. Because KVM emulates the TSC timer or the TSC deadline timer with the TSC offset/multiplier, the TSC timer interrupts is injected to the guest at the wrong time if the KVM TSC offset is different from what the TDX module determined. Originally this issue was found by cyclic test of rt-test [1] as the latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It turned out that the KVM TSC offset is different from what the TDX module determines. Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier somehow, thus keeping the KVM TSC offset/multiplier the same as the value of the TDX module. Writes to MSR_IA32_TSC are also blocked as they amount to a change in the TSC offset. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL in common x86Sean Christopherson
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot. Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: x86: Use a dedicated flow for queueing re-injected exceptionsSean Christopherson
Open code the filling of vcpu->arch.exception in kvm_requeue_exception() instead of bouncing through kvm_multiple_exception(), as re-injection doesn't actually share that much code with "normal" injection, e.g. the VM-Exit interception check, payload delivery, and nested exception code is all bypassed as those flows only apply during initial injection. When FRED comes along, the special casing will only get worse, as FRED explicitly tracks nested exceptions and essentially delivers the payload on the stack frame, i.e. re-injection will need more inputs, and normal injection will have yet more code that needs to be bypassed when KVM is re-injecting an exception. No functional change intended. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20241001050110.3643764-2-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25KVM: x86: Rename and invert async #PF's send_user_only flag to send_alwaysSean Christopherson
Rename send_user_only to avoid "user", because KVM's ABI is to not inject page faults into CPL0, whereas "user" in x86 is specifically CPL3. Invert the polarity to keep the naming simple and unambiguous. E.g. while KVM often refers to CPL0 as "kernel", that terminology isn't ubiquitous, and "send_kernel" could be misconstrued as "send only to kernel". Link: https://lore.kernel.org/r/20250215010609.1199982-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: x86/xen: Move kvm_xen_hvm_config field into kvm_xenSean Christopherson
Now that all KVM usage of the Xen HVM config information is buried behind CONFIG_KVM_XEN=y, move the per-VM kvm_xen_hvm_config field out of kvm_arch and into kvm_xen. No functional change intended. Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250215011437.1203084-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-24KVM: x86/xen: Bury xen_hvm_config behind CONFIG_KVM_XEN=ySean Christopherson
Now that all references to kvm_vcpu_arch.xen_hvm_config are wrapped with CONFIG_KVM_XEN #ifdefs, bury the field itself behind CONFIG_KVM_XEN=y. No functional change intended. Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250215011437.1203084-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lockSean Christopherson
Steal another bit from rmap entries (which are word aligned pointers, i.e. have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use the bit to implement a *very* rudimentary per-rmap spinlock. The only anticipated usage of the lock outside of mmu_lock is for aging gfns, and collisions between aging and other MMU rmap operations are quite rare, e.g. unless userspace is being silly and aging a tiny range over and over in a tight loop, time between contention when aging an actively running VM is O(seconds). In short, a more sophisticated locking scheme shouldn't be necessary. Note, the lock only protects the rmap structure itself, SPTEs that are pointed at by a locked rmap can still be modified and zapped by another task (KVM drops/zaps SPTEs before deleting the rmap entries) Co-developed-by: James Houghton <jthoughton@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-10-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-14KVM: x86/mmu: Age TDP MMU SPTEs without holding mmu_lockSean Christopherson
Walk the TDP MMU in an RCU read-side critical section without holding mmu_lock when harvesting and potentially updating age information on TDP MMU SPTEs. Add a new macro to do RCU-safe walking of TDP MMU roots, and do all SPTE aging with atomic updates; while clobbering Accessed information is ok, KVM must not corrupt other bits, e.g. must not drop a Dirty or Writable bit when making a SPTE young.. If updating a SPTE to mark it for access tracking fails, leave it as is and treat it as if it were young. If the spte is being actively modified, it is most likely young. Acquire and release mmu_lock for write when harvesting age information from the shadow MMU, as the shadow MMU doesn't yet support aging outside of mmu_lock. Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Remove per-vCPU "cache" of its reference pvclockSean Christopherson
Remove the per-vCPU "cache" of the reference pvclock and instead cache only the TSC shift+multiplier. All other fields in pvclock are fully recomputed by kvm_guest_time_update(), i.e. aren't actually persisted. In addition to shaving a few bytes, explicitly tracking the TSC shift/mul fields makes it easier to see that those fields are tied to hw_tsc_khz (they exist to avoid having to do expensive math in the common case). And conversely, not tracking the other fields makes it easier to see that things like the version number are pulled from the guest's copy, not from KVM's reference. Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20250201013827.680235-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulationSean Christopherson
Defer runtime CPUID updates until the next non-faulting CPUID emulation or KVM_GET_CPUID2, which are the only paths in KVM that consume the dynamic entries. Deferring the updates is especially beneficial to nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state changes, not to mention the updates don't need to be realized while L2 is active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept on Intel, but not AMD). Deferring CPUID updates shaves several hundred cycles from nested VMX roundtrips, as measured from L2 executing CPUID in a tight loop: SKX 6850 => 6450 ICX 9000 => 8800 EMR 7900 => 7700 Alternatively, KVM could update only the CPUID leaves that are affected by the state change, e.g. update XSAVE info only if XCR0 or XSS changes, but that adds non-trivial complexity and doesn't solve the underlying problem of nested transitions potentially changing both XCR0 and XSS, on both nested VM-Enter and VM-Exit. Skipping updates entirely if L2 is active and CPUID is being intercepted by L1 could work for the common case. However, simply skipping updates if L2 is active is *very* subtly dangerous and complex. Most KVM updates are triggered by changes to the current vCPU state, which may be L2 state, whereas performing updates only for L1 would requiring detecting changes to L1 state. KVM would need to either track relevant L1 state, or defer runtime CPUID updates until the next nested VM-Exit. The former is ugly and complex, while the latter comes with similar dangers to deferring all CPUID updates, and would only address the nested VM-Enter path. To guard against using stale data, disallow querying dynamic CPUID feature bits, i.e. features that KVM updates at runtime, via a compile-time assertion in guest_cpu_cap_has(). Exempt MWAIT from the rule, as the MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID feature. Note, the rule could be enforced for MWAIT as well, e.g. by querying guest CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can of worms. MONITOR/MWAIT can't be virtualized (for a reasonable definition), and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is wrong for other reasons. Beyond the aforementioned feature bits, the only other dynamic CPUID (sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those CPUID entries in KVM is all but guaranteed to be a bug. The layout for an actual XSAVE buffer depends on the format (compacted or not) and potentially the features that are actually enabled. E.g. see the logic in fpstate_clear_xstate_component() needed to poke into the guest's effective XSAVE state to clear MPX state on INIT. KVM does consume CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(), but not EBX, which is the only dynamic output register in the leaf. Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loopSean Christopherson
Move the conditional loading of hardware DR6 with the guest's DR6 value out of the core .vcpu_run() loop to fix a bug where KVM can load hardware with a stale vcpu->arch.dr6. When the guest accesses a DR and host userspace isn't debugging the guest, KVM disables DR interception and loads the guest's values into hardware on VM-Enter and saves them on VM-Exit. This allows the guest to access DRs at will, e.g. so that a sequence of DR accesses to configure a breakpoint only generates one VM-Exit. For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest) and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop. But for DR6, the guest's value doesn't need to be loaded into hardware for KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas VMX requires software to manually load the guest value, and so loading the guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done _inside_ the core run loop. Unfortunately, saving the guest values on VM-Exit is initiated by common x86, again outside of the core run loop. If the guest modifies DR6 (in hardware, when DR interception is disabled), and then the next VM-Exit is a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and clobber the guest's actual value. The bug shows up primarily with nested VMX because KVM handles the VMX preemption timer in the fastpath, and the window between hardware DR6 being modified (in guest context) and DR6 being read by guest software is orders of magnitude larger in a nested setup. E.g. in non-nested, the VMX preemption timer would need to fire precisely between #DB injection and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the window where hardware DR6 is "dirty" extends all the way from L1 writing DR6 to VMRESUME (in L1). L1's view: ========== <L1 disables DR interception> CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0 A: L1 Writes DR6 CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1 B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec D: L1 reads DR6, arch.dr6 = 0 CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0 CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0 L2 reads DR6, L1 disables DR interception CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216 CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0 CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0 L2 detects failure CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT L1 reads DR6 (confirms failure) CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0 L0's view: ========== L2 reads DR6, arch.dr6 = 0 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 L2 => L1 nested VM-Exit CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23 L1 writes DR7, L0 disables DR interception CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007 CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23 L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0 A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'> B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23 C: L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0 L1 => L2 nested VM-Enter CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME L0 reads DR6, arch.dr6 = 0 Reported-by: John Stultz <jstultz@google.com> Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com Fixes: 375e28ffc0cf ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT") Fixes: d67668e9dd76 ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-11KVM: x86: Remove unused iommu_domain and iommu_noncoherent from kvm_archTed Chen
Remove the "iommu_domain" and "iommu_noncoherent" fields from struct kvm_arch, which are no longer used since commit ad6260da1e23 ("KVM: x86: drop legacy device assignment"). Signed-off-by: Ted Chen <znscnchen@gmail.com> Link: https://lore.kernel.org/r/20250124075055.97158-1-znscnchen@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "Loongarch: - Clear LLBCTL if secondary mmu mapping changes - Add hypercall service support for usermode VMM x86: - Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a direct call to kvm_tdp_page_fault() when RETPOLINE is enabled - Ensure that all SEV code is compiled out when disabled in Kconfig, even if building with less brilliant compilers - Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes - Use str_enabled_disabled() to replace open coded strings - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache prior to every VM-Enter - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU - Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not - As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. When TDX will be enabled, operations on private pages will need to go through the privileged TDX Module via SEAMCALLs; as a result, they are limited and relatively slow compared to reading a PTE. The patches included in 6.14 allow KVM to keep a mirror of the private EPT in host memory, and define entries in kvm_x86_ops to operate on external page tables such as the TDX private EPT - The recently introduced conversion of the NX-page reclamation kthread to vhost_task moved the task under the main process. The task is created as soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that didn't expect to see any child task of the VM process until it started creating its own userspace threads. In particular crosvm refuses to fork() if procfs shows any child task, so unbreak it by creating the task lazily. This is arguably a userspace bug, as there can be other kinds of legitimate worker tasks and they wouldn't impede fork(); but it's not like userspace has a way to distinguish kernel worker tasks right now. Should they show as "Kthread: 1" in proc/.../status? x86 - Intel: - Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit while L2 is active, while ultimately results in a hardware-accelerated L1 EOI effectively being lost - Honor event priority when emulating Posted Interrupt delivery during nested VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the interrupt - Rework KVM's processing of the Page-Modification Logging buffer to reap entries in the same order they were created, i.e. to mark gfns dirty in the same order that hardware marked the page/PTE dirty - Misc cleanups Generic: - Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when setting memory regions and add a dedicated API for setting KVM-internal memory regions. The API can then explicitly disallow all flags for KVM-internal memory regions - Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug where KVM would return a pointer to a vCPU prior to it being fully online, and give kvm_for_each_vcpu() similar treatment to fix a similar flaw - Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a bug where userspace could coerce KVM into handling the ioctl on a vCPU that isn't yet onlined - Gracefully handle xarray insertion failures; even though such failures are impossible in practice after xa_reserve(), reserving an entry is always followed by xa_store() which does not know (or differentiate) whether there was an xa_reserve() before or not RISC-V: - Zabha, Svvptc, and Ziccrse extension support for guests. None of them require anything in KVM except for detecting them and marking them as supported; Zabha adds byte and halfword atomic operations, while the others are markers for specific operation of the TLB and of LL/SC instructions respectively - Virtualize SBI system suspend extension for Guest/VM - Support firmware counters which can be used by the guests to collect statistics about traps that occur in the host Selftests: - Rework vcpu_get_reg() to return a value instead of using an out-param, and update all affected arch code accordingly - Convert the max_guest_memory_test into a more generic mmu_stress_test. The basic gist of the "conversion" is to have the test do mprotect() on guest memory while vCPUs are accessing said memory, e.g. to verify KVM and mmu_notifiers are working as intended - Play nice with treewrite builds of unsupported architectures, e.g. arm (32-bit), as KVM selftests' Makefile doesn't do anything to ensure the target architecture is actually one KVM selftests supports - Use the kernel's $(ARCH) definition instead of the target triple for arch specific directories, e.g. arm64 instead of aarch64, mainly so as not to be different from the rest of the kernel - Ensure that format strings for logging statements are checked by the compiler even when the logging statement itself is disabled - Attempt to whack the last LLC references/misses mole in the Intel PMU counters test by adding a data load and doing CLFLUSH{OPT} on the data instead of the code being executed. It seems that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that events are counting correctly without actually knowing what the events count given the underlying hardware; this can happen if Intel reuses a formerly microarchitecture-specific event encoding as an architectural event, as was the case for Top-Down Slots" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits) kvm: defer huge page recovery vhost task to later KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() KVM: Disallow all flags for KVM-internal memslots KVM: x86: Drop double-underscores from __kvm_set_memory_region() KVM: Add a dedicated API for setting KVM-internal memslots KVM: Assert slots_lock is held when setting memory regions KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API) LoongArch: KVM: Add hypercall service support for usermode VMM LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup() KVM: VMX: read the PML log in the same order as it was written KVM: VMX: refactor PML terminology KVM: VMX: Fix comment of handle_vmx_instruction() KVM: VMX: Reinstate __exit attribute for vmx_exit() KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup() KVM: x86: Avoid double RDPKRU when loading host/guest PKRU KVM: x86: Use LVT_TIMER instead of an open coded literal RISC-V: KVM: Add new exit statstics for redirected traps RISC-V: KVM: Update firmware counters for various events RISC-V: KVM: Redirect instruction access fault trap to guest ...
2025-01-24kvm: defer huge page recovery vhost task to laterKeith Busch
Some libraries want to ensure they are single threaded before forking, so making the kernel's kvm huge page recovery process a vhost task of the user process breaks those. The minijail library used by crosvm is one such affected application. Defer the task to after the first VM_RUN call, which occurs after the parent process has forked all its jailed processes. This needs to happen only once for the kvm instance, so introduce some general-purpose infrastructure for that, too. It's similar in concept to pthread_once; except it is actually usable, because the callback takes a parameter. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alyssa Ross <hi@alyssa.is> Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250123153543.2769928-1-kbusch@meta.com> [Move call_once API to include/linux. - Paolo] Cc: stable@vger.kernel.org Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-01-20Merge branch 'kvm-mirror-page-tables' into HEADPaolo Bonzini
As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.
2025-01-20Merge branch 'kvm-userspace-hypercall' into HEADPaolo Bonzini
Make the completion of hypercalls go through the complete_hypercall function pointer argument, no matter if the hypercall exits to userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically went to userspace, and all the others did not; the new code need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at all whether there was an exit to userspace or not.
2025-01-20Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature. - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-10hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.hNuno Das Neves
Switch to using hvhdk.h everywhere in the kernel. This header includes all the new Hyper-V headers in include/hyperv, which form a superset of the definitions found in hyperv-tlfs.h. This makes it easier to add new Hyper-V interfaces without being restricted to those in the TLFS doc (reflected in hyperv-tlfs.h). To be more consistent with the original Hyper-V code, the names of some definitions are changed slightly. Update those where needed. Update comments in mshyperv.h files to point to include/hyperv for adding new definitions. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-01-08hyperv: Clean up unnecessary #includesNuno Das Neves
Remove includes of linux/hyperv.h, mshyperv.h, and hyperv-tlfs.h where they are not used. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Acked-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://lore.kernel.org/r/1732577084-2122-3-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1732577084-2122-3-git-send-email-nunodasneves@linux.microsoft.com>
2024-12-23KVM: x86/tdp_mmu: Propagate tearing down mirror page tablesIsaku Yamahata
Integrate hooks for mirroring page table operations for cases where TDX will zap PTEs or free page tables. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly though calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. Since calls into the TDX module are relatively slow, walking private page tables by making calls into the TDX module would not be efficient. Because of this, previous changes have taught the TDP MMU to keep a mirror root, which is separate, unmapped TDP root that private operations can be directed to. Currently this root is disconnected from the guest. Now add plumbing to propagate changes to the "external" page tables being mirrored. Just create the x86_ops for now, leave plumbing the operations into the TDX module for future patches. Add two operations for tearing down page tables, one for freeing page tables (free_external_spt) and one for zapping PTEs (remove_external_spte). Define them such that remove_external_spte will perform a TLB flush as well. (in TDX terms "ensure there are no active translations"). TDX MMU support will exclude certain MMU operations, so only plug in the mirroring x86 ops where they will be needed. For zapping/freeing, only hook tdp_mmu_iter_set_spte() which is used for mapping and linking PTs. Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and kvm_mmu_zap_all_fast(). In previous changes to address races around concurrent populating using tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set FROZEN_SPTE in the mirrored page tables while performing the external operations. Such a solution is not needed for the tear down paths in TDX as these will always be performed with the mmu_lock held for write. Sprinkle some KVM_BUG_ON()s to reflect this. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-16-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Propagate building mirror page tablesIsaku Yamahata
Integrate hooks for mirroring page table operations for cases where TDX will set PTEs or link page tables. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. Since calls into the TDX module are relatively slow, walking private page tables by making calls into the TDX module would not be efficient. Because of this, previous changes have taught the TDP MMU to keep a mirror root, which is separate, unmapped TDP root that private operations can be directed to. Currently this root is disconnected from any actual guest mapping. Now add plumbing to propagate changes to the "external" page tables being mirrored. Just create the x86_ops for now, leave plumbing the operations into the TDX module for future patches. Add two operations for setting up external page tables, one for linking new page tables and one for setting leaf PTEs. Don't add any op for configuring the root PFN, as TDX handles this itself. Don't provide a way to set permissions on the PTEs also, as TDX doesn't support it. This results in MMU "mirroring" support that is very targeted towards TDX. Since it is likely there will be no other user, the main benefit of making the support generic is to keep TDX specific *looking* code outside of the MMU. As a generic feature it will make enough sense from TDX's perspective. For developers unfamiliar with TDX arch it can express the general concepts such that they can continue to work in the code. TDX MMU support will exclude certain MMU operations, so only plug in the mirroring x86 ops where they will be needed. For setting/linking, only hook tdp_mmu_set_spte_atomic() which is used for mapping and linking PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for setting PTEs in operations unsupported by TDX: splitting huge pages and write protecting. Sprinkle KVM_BUG_ON()s to document as code that these paths are not supported for mirrored page tables. For zapping operations, leave those for near future changes. Many operations in the TDP MMU depend on atomicity of the PTE update. While the mirror PTE on KVM's side can be updated atomically, the update that happens inside the external operations (S-EPT updates via TDX module call) can't happen atomically with the mirror update. The following race could result during two vCPU's populating private memory: * vcpu 1: atomically update 2M level mirror EPT entry to be present * vcpu 2: read 2M level EPT entry that is present * vcpu 2: walk down into 4K level EPT * vcpu 2: atomically update 4K level mirror EPT entry to be present * vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error because 2M secure EPT entry is not populated yet * vcpu 1: link_external_spt() to update 2M secure EPT entry Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect operations are performed. Only write the actual mirror PTE value once the reflect operations have completed. When trying to set a PTE to present and encountering a frozen SPTE, retry the fault. By doing this the race is prevented as follows: * vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE * vcpu 2: read 2M level EPT entry that is FROZEN_SPTE * vcpu 2: find that the EPT entry is frozen abandon page table walk to resume guest execution * vcpu 1: link_external_spt() to update 2M secure EPT entry * vcpu 1: atomically update 2M level EPT entry to be present (unfreeze) * vcpu 2: resume guest execution Depending on vcpu 1 state, vcpu 2 may result in EPT violation again or make progress on guest execution Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-15-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/tdp_mmu: Support mirror root for TDP MMUIsaku Yamahata
Add the ability for the TDP MMU to maintain a mirror of a separate mapping. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. In order to handle both shared and private memory, KVM needs to learn to handle faults and other operations on the correct root for the operation. KVM could learn the concept of private roots, and operate on them by calling out to operations that call into the TDX module. But there are two problems with that: 1. Calls into the TDX module are relatively slow compared to the simple accesses required to read a PTE managed directly by KVM. 2. Other Coco technologies deal with private memory completely differently and it will make the code confusing when being read from their perspective. Special operations added for TDX that set private or zap private memory will have nothing to do with these other private memory technologies. (SEV, etc). To handle these, instead teach the TDP MMU about a new concept "mirror roots". Such roots maintain page tables that are not actually mapped, and are just used to traverse quickly to determine if the mid level page tables need to be installed. When the memory be mirrored needs to actually be changed, calls can be made to via x86_ops. private KVM page fault | | | V | private GPA | CPU protected EPTP | | | V | V mirror PT root | external PT root | | | V | V mirror PT --hook to propagate-->external PT | | | \--------------------+------\ | | | | | V V | private guest page | | non-encrypted memory | encrypted memory | Leave calling out to actually update the private page tables that are being mirrored for later changes. Just implement the handling of MMU operations on to mirrored roots. In order to direct operations to correct root, add root types KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct roots to private/shared with conditionals. It could also be implemented by making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line up such that conversion could be a direct assignment with a case. Don't do this because the mapping of private to mirrored is confusing enough. So it is worth not hiding the logic in type casting. Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place in kvm_mmu_free_roots(), because the private root that is being cannot be rebuilt like a normal root. It needs to persist for the lifetime of the VM. The TDX module will also need to be provided with page tables to use for the actual mapping being mirrored by the mirrored page tables. Allocate these in the mapping path using the recently added kvm_mmu_alloc_external_spt(). Don't support 2M page for now. This is avoided by forcing 4k pages in the fault. Add a KVM_BUG_ON() to verify. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/mmu: Support GFN direct bitsIsaku Yamahata
Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid in implementing TDX shared memory. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. For TDX, the shared half will be mapped in the higher alias, with a "shared bit" set in the GPA. However, KVM will still manage it with the same memslots as the private half. This means memslot looks ups and zapping operations will be provided with a GFN without the shared bit set. So KVM will either need to apply or strip the shared bit before mapping or zapping the shared EPT. Having GFNs sometimes have the shared bit and sometimes not would make the code confusing. So instead arrange the code such that GFNs never have shared bit set. Create a concept of "direct bits", that is stripped from the fault address when setting fault->gfn, and applied within the TDP MMU iterator. Calling code will behave as if it is operating on the PTE mapping the GFN (without shared bits) but within the iterator, the actual mappings will be shifted using bits specific for the root. SPs will have the GFN set without the shared bit. In the end the TDP MMU will behave like it is mapping things at the GFN without the shared bit but with a strange page table format where everything is offset by the shared bit. Since TDX only needs to shift the mapping like this for the shared bit, which is mapped as the normal TDP root, add a "gfn_direct_bits" field to the kvm_arch structure for each VM with a default value of 0. It will have the bit set at the position of the GPA shared bit in GFN through TD specific initialization code. Keep TDX specific concepts out of the MMU code by not naming it "shared". Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN ranges. In convention established above, these would need to target the shifted GFN range. It won't matter functionally, since the actual implementation will always result in a full flush for the only planned user (TDX). For correctness reasons, future changes can provide a TDX x86_ops.flush_remote_tlbs_range implementation to return -EOPNOTSUPP and force the full flush for TDs. This leaves one problem. Some operations use a concept of max GFN (i.e. kvm_mmu_max_gfn()), to iterate over the whole TDP range. When applying the direct mask to the start of the range, the iterator would end up skipping iterating over the range not covered by the direct mask bit. For safety, make sure the __tdp_mmu_zap_root() operation iterates over the full GFN range supported by the underlying TDP format. Add a new iterator helper, for_each_tdp_pte_min_level_all(), that iterates the entire TDP GFN range, regardless of root. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-9-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/mmu: Add an is_mirror member for union kvm_mmu_page_roleIsaku Yamahata
Introduce a "is_mirror" member to the kvm_mmu_page_role union to identify SPTEs associated with the mirrored EPT. The TDX module maintains the private half of the EPT mapped in the TD in its protected memory. KVM keeps a copy of the private GPAs in a mirrored EPT tree within host memory. This "is_mirror" attribute enables vCPUs to find and get the root page of mirrored EPT from the MMU root list for a guest TD. This also allows KVM MMU code to detect changes in mirrored EPT according to the "is_mirror" mmu page role and propagate the changes to the private EPT managed by TDX module. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-6-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-23KVM: x86/mmu: Add an external pointer to struct kvm_mmu_pageIsaku Yamahata
Add an external pointer to struct kvm_mmu_page for TDX's private page table and add helper functions to allocate/initialize/free a private page table page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU doesn't use unsync_children and write_flooding_count, pack them to have room for a pointer and use a union to avoid memory overhead. For private GPA, CPU refers to a private page table whose contents are encrypted. The dedicated APIs to operate on it (e.g. updating/reading its PTE entry) are used, and their cost is expensive. When KVM resolves the KVM page fault, it walks the page tables. To reuse the existing KVM MMU code and mitigate the heavy cost of directly walking the private page table allocate two sets of page tables for the private half of the GPA space. For the page tables that KVM will walk, allocate them like normal and refer to them as mirror page tables. Additionally allocate one more page for the page tables the CPU will walk, and call them external page tables. Resolve the KVM page fault with the existing code, and do additional operations necessary for modifying the external page table in future patches. The relationship of the types of page tables in this scheme is depicted below: KVM page fault | | | V | -------------+---------- | | | | V V | shared GPA private GPA | | | | V V | shared PT root mirror PT root | private PT root | | | | V V | V shared PT mirror PT --propagate--> external PT | | | | | \-----------------+------\ | | | | | V | V V shared guest page | private guest page | non-encrypted memory | encrypted memory | PT - Page table Shared PT - Visible to KVM, and the CPU uses it for shared mappings. External PT - The CPU uses it, but it is invisible to KVM. TDX module updates this table to map private guest pages. Mirror PT - It is visible to KVM, but the CPU doesn't use it. KVM uses it to propagate PT change to the actual private PT. Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it to the TDX vm type. Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22KVM: x86: Move "emulate hypercall" function declarations to x86.hSean Christopherson
Move the declarations for the hypercall emulation APIs to x86.h. While the helpers are exported, they are intended to be consumed only by KVM vendor modules, i.e. don't need to be exposed to the kernel at-large. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20241128004344.4072099-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-19KVM: x86: Remove hwapic_irr_update() from kvm_x86_opsChao Gao
Remove the redundant .hwapic_irr_update() ops. If a vCPU has APICv enabled, KVM updates its RVI before VM-enter to L1 in vmx_sync_pir_to_irr(). This guarantees RVI is up-to-date and aligned with the vIRR in the virtual APIC. So, no need to update RVI every time the vIRR changes. Note that KVM never updates vmcs02 RVI in .hwapic_irr_update() or vmx_sync_pir_to_irr(). So, removing .hwapic_irr_update() has no impact to the nested case. Signed-off-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20241111085947.432645-1-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Add interrupt injection information to the kvm_entry tracepointMaxim Levitsky
Add VMX/SVM specific interrupt injection info the kvm_entry tracepoint. As is done with kvm_exit, gather the information via a kvm_x86_ops hook to avoid the moderately costly VMREADs on VMX when the tracepoint isn't enabled. Opportunistically rename the parameters in the get_exit_info() declaration to match the names used by both SVM and VMX. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240910200350.264245-2-mlevitsk@redhat.com [sean: drop is_guest_mode() change, use intr_info/error_code for names] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: VMX: Handle event vectoring error in check_emulate_instruction()Ivan Orlov
Move handling of emulation during event vectoring, which KVM doesn't support, into VMX's check_emulate_instruction(), so that KVM detects all unsupported emulation, not just cached emulated MMIO (EPT misconfig). E.g. on emulated MMIO that isn't cached (EPT Violation) or occurs with legacy shadow paging (#PF). Rejecting emulation on other sources of emulation also fixes a largely theoretical flaw (thanks to the "unprotect and retry" logic), where KVM could incorrectly inject a #DF: 1. CPU executes an instruction and hits a #GP 2. While vectoring the #GP, a shadow #PF occurs 3. On the #PF VM-Exit, KVM re-injects #GP 4. KVM emulates because of the write-protected page 5. KVM "successfully" emulates and also detects the #GP 6. KVM synthesizes a #GP, and since #GP has already been injected, incorrectly escalates to a #DF. Fix the comment about EMULTYPE_PF as this flag doesn't necessarily mean MMIO anymore: it can also be set due to the write protection violation. Note, handle_ept_misconfig() checks vmx_check_emulate_instruction() before attempting emulation of any kind. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ivan Orlov <iorlov@amazon.com> Link: https://lore.kernel.org/r/20241217181458.68690-5-iorlov@amazon.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Add function for vectoring error generationIvan Orlov
Extract VMX code for unhandleable VM-Exit during vectoring into vendor-agnostic function so that boiler-plate code can be shared by SVM. To avoid unnecessarily complexity in the helper, unconditionally report a GPA to userspace instead of having a conditional entry. For exits that don't report a GPA, i.e. everything except EPT Misconfig, simply report KVM's "invalid GPA". Signed-off-by: Ivan Orlov <iorlov@amazon.com> Link: https://lore.kernel.org/r/20241217181458.68690-2-iorlov@amazon.com [sean: clarify that the INVALID_GPA logic is new] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Replace guts of "governed" features with comprehensive cpu_capsSean Christopherson
Replace the internals of the governed features framework with a more comprehensive "guest CPU capabilities" implementation, i.e. with a guest version of kvm_cpu_caps. Keep the skeleton of governed features around for now as vmx_adjust_sec_exec_control() relies on detecting governed features to do the right thing for XSAVES, and switching all guest feature queries to guest_cpu_cap_has() requires subtle and non-trivial changes, i.e. is best done as a standalone change. Tracking *all* guest capabilities that KVM cares will allow excising the poorly named "governed features" framework, and effectively optimizes all KVM queries of guest capabilities, i.e. doesn't require making a subjective decision as to whether or not a feature is worth "governing", and doesn't require adding the code to do so. The cost of tracking all features is currently 92 bytes per vCPU on 64-bit kernels: 100 bytes for cpu_caps versus 8 bytes for governed_features. That cost is well worth paying even if the only benefit was eliminating the "governed features" terminology. And practically speaking, the real cost is zero unless those 92 bytes pushes the size of vcpu_vmx or vcpu_svm into a new order-N allocation, and if that happens there are better ways to reduce the footprint of kvm_vcpu_arch, e.g. making the PMU and/or MTRR state separate allocations. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241128013424.4096668-41-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18KVM: x86: Remove unnecessary caching of KVM's PV CPUID baseSean Christopherson
Now that KVM only searches for KVM's PV CPUID base when userspace sets guest CPUID, drop the cache and simply do the search every time. Practically speaking, this is a nop except for situations where userspace sets CPUID _after_ running the vCPU, which is anything but a hot path, e.g. QEMU does so only when hotplugging a vCPU. And on the flip side, caching guest CPUID information, especially information that is used to query/modify _other_ CPUID state, is inherently dangerous as it's all too easy to use stale information, i.e. KVM should only cache CPUID state when the performance and/or programming benefits justify it. Link: https://lore.kernel.org/r/20241128013424.4096668-34-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16KVM: x86: Plumb in the vCPU to kvm_x86_ops.hwapic_isr_update()Sean Christopherson
Pass the target vCPU to the hwapic_isr_update() vendor hook so that VMX can defer the update until after nested VM-Exit if an EOI for L1's vAPIC occurs while L2 is active. Note, commit d39850f57d21 ("KVM: x86: Drop @vcpu parameter from kvm_x86_ops.hwapic_isr_update()") removed the parameter with the justification that doing so "allows for a decent amount of (future) cleanup in the APIC code", but it's not at all clear what cleanup was intended, or if it was ever realized. No functional change intended. Cc: stable@vger.kernel.org Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20241128000010.4051275-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-14KVM: x86: switch hugepage recovery thread to vhost_taskPaolo Bonzini
kvm_vm_create_worker_thread() is meant to be used for kthreads that can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory). Therefore it wants to charge the CPU time consumed by that work to the VM's container. However, because of these threads, cgroups which have kvm instances inside never complete freezing. This can be trivially reproduced: root@test ~# mkdir /sys/fs/cgroup/test root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs root@test ~# qemu-system-x86_64 -nographic -enable-kvm and in another terminal: root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze root@test ~# cat /sys/fs/cgroup/test/cgroup.events populated 1 frozen 0 The cgroup freezing happens in the signal delivery path but kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never calls into the signal delivery path and thus never gets frozen. Because the cgroup freezer determines whether a given cgroup is frozen by comparing the number of frozen threads to the total number of threads in the cgroup, the cgroup never becomes frozen and users waiting for the state transition may hang indefinitely. Since the worker kthread is tied to a user process, it's better if it behaves similarly to user tasks as much as possible, including being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted to be and more: not only it inherits the userspace process's cgroups, it has other niceties like being parented properly in the process tree. Use it instead of the homegrown alternative. Incidentally, the new code is also better behaved when you flip recovery back and forth to disabled and back to enabled. If your recovery period is 1 minute, it will run the next recovery after 1 minute independent of how many times you flipped the parameter. (Commit message based on emails from Tejun). Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Luca Boccassi <bluca@debian.org> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Luca Boccassi <bluca@debian.org> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-13Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.13 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe, and harden the cache accessors to try to prevent similar issues from occuring in the future. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups
2024-11-04KVM: x86/mmu: Drop per-VM zapped_obsolete_pages listVipin Sharma
Drop the per-VM zapped_obsolete_pages list now that the usage from the defunct mmu_shrinker is gone, and instead use a local list to track pages in kvm_zap_obsolete_pages(), the sole remaining user of zapped_obsolete_pages. Opportunistically add an assertion to verify and document that slots_lock must be held, i.e. that there can only be one active instance of kvm_zap_obsolete_pages() at any given time, and by doing so also prove that using a local list instead of a per-VM list doesn't change any functionality (beyond trivialities like list initialization). Signed-off-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zappingDavid Matlack
Recover TDP MMU huge page mappings in-place instead of zapping them when dirty logging is disabled, and rename functions that recover huge page mappings when dirty logging is disabled to move away from the "zap collapsible spte" terminology. Before KVM flushes TLBs, guest accesses may be translated through either the (stale) small SPTE or the (new) huge SPTE. This is already possible when KVM is doing eager page splitting (where TLB flushes are also batched), and when vCPUs are faulting in huge mappings (where TLBs are flushed after the new huge SPTE is installed). Recovering huge pages reduces the number of page faults when dirty logging is disabled: $ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g Before: 393,599 kvm:kvm_page_fault After: 262,575 kvm:kvm_page_fault vCPU throughput and the latency of disabling dirty-logging are about equal compared to zapping, but avoiding faults can be beneficial to remove vCPU jitter in extreme scenarios. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Quirk initialization of feature MSRs to KVM's max configurationSean Christopherson
Add a quirk to control KVM's misguided initialization of select feature MSRs to KVM's max configuration, as enabling features by default violates KVM's approach of letting userspace own the vCPU model, and is actively problematic for MSRs that are conditionally supported, as the vCPU will end up with an MSR value that userspace can't restore. E.g. if the vCPU is configured with PDCM=0, userspace will save and attempt to restore a non-zero PERF_CAPABILITIES, thanks to KVM's meddling. Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Bypass register cache when querying CPL from kvm_sched_out()Sean Christopherson
When querying guest CPL to determine if a vCPU was preempted while in kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from the VMCS on Intel CPUs. If the kernel is running with full preemption enabled, using the register cache in the preemption path can result in stale and/or uninitialized data being cached in the segment cache. In particular the following scenario is currently possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Note, the VM-Enter failure can also be avoided by moving the call to vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all segments. However, while that change is correct and desirable (and will come along shortly), it does not address the underlying problem that accessing KVM's register caches from !task context is generally unsafe. In addition to fixing the immediate bug, bypassing the cache for this particular case will allow hardening KVM register caching log to assert that the caches are accessed only when KVM _knows_ it is safe to do so. Fixes: de63ad4cf497 ("KVM: X86: implement the logic for spinlock optimization") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-17Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.12: - Set FINAL/PAGE in the page fault error code for EPT Violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical. - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2. - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures. - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible). - Minor SGX fix and a cleanup.
2024-09-17Merge tag 'kvm-x86-mmu-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.12: - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop. - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU. - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding MGLRU support in KVM. - Misc cleanups
2024-09-17Merge tag 'kvm-x86-misc-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.12 - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon). - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs. This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work). - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value a the ICR offset. - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler. - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit. - Finally fix the RSM vs. nested VM-Enter WARN by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (architecturally, the SHUTDOWN is supposed to hit L1, not L2).