summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2025-03-14KVM: TDX: create/destroy VM structureIsaku Yamahata
Implement managing the TDX private KeyID to implement, create, destroy and free for a TDX guest. When creating at TDX guest, assign a TDX private KeyID for the TDX guest for memory encryption, and allocate pages for the guest. These are used for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS). On destruction, free the allocated pages, and the KeyID. Before tearing down the private page tables, TDX requires the guest TD to be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops hook. The TDR control structures can be freed in the vm_destroy() hook, which runs last. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix build issue in kvm-coco-queue - Init ret earlier to fix __tdx_td_init() error handling. (Chao) - Standardize -EAGAIN for __tdx_td_init() retry errors (Rick) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Get system-wide info about TDX module on initializationIsaku Yamahata
TDX KVM needs system-wide information about the TDX module. Generate the data based on tdx_sysinfo td_conf CPUID data. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> --- - Clarify comment about EAX[23:16] in td_init_cpuid_entry2() (Xiaoyao) - Add comment for configurable CPUID bits (Xiaoyao) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctlIsaku Yamahata
KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for guest state-protected VM. It defined subcommands for technology-specific operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands are not limited to memory encryption, but various technology-specific operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP for TDX specific operations and define subcommands. Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op. TDX specific sub-commands will be added to retrieve/pass TDX specific parameters. Make mem_enc_ioctl non-optional as it's always filled. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Drop the misleading "defined for consistency" line. It's a copy-paste error introduced in the earlier patches. Earlier there was padding at the end to match struct kvm_sev_cmd size. (Tony) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add helper functions to print TDX SEAMCALL errorIsaku Yamahata
Add helper functions to print out errors from the TDX module in a uniform manner. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add TDX "architectural" error codesSean Christopherson
Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH SEAMCALL and TDX guest side for TDG.VP.VMCALL. KVM issues the TDX SEAMCALLs and checks its error code. KVM handles hypercall from the TDX guest and may return an error. So error code for the TDX guest is also needed. TDX SEAMCALL uses bits 31:0 to return more information, so these error codes will only exactly match RAX[63:32]. Error codes for TDG.VP.VMCALL is defined by TDX Guest-Host-Communication interface spec. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20241030190039.77971-14-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Define TDX architectural definitionsIsaku Yamahata
Define architectural definitions for KVM to issue the TDX SEAMCALLs. Structures and values that are architecturally defined in the TDX module specifications the chapter of ABI Reference. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> --- - Drop old duplicate defines, the x86 core exports what's needed (Kai) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add placeholders for TDX VM/vCPU structuresIsaku Yamahata
Add TDX's own VM and vCPU structures as placeholder to manage and run TDX guests. Also add helper functions to check whether a VM/vCPU is TDX or normal VMX one, and add helpers to convert between TDX VM/vCPU and KVM VM/vCPU. TDX protects guest VMs from malicious host. Unlike VMX guests, TDX guests are crypto-protected. KVM cannot access TDX guests' memory and vCPU states directly. Instead, TDX requires KVM to use a set of TDX architecture-defined firmware APIs (a.k.a TDX module SEAMCALLs) to manage and run TDX guests. In fact, the way to manage and run TDX guests and normal VMX guests are quite different. Because of that, the current structures ('struct kvm_vmx' and 'struct vcpu_vmx') to manage VMX guests are not quite suitable for TDX guests. E.g., the majority of the members of 'struct vcpu_vmx' don't apply to TDX guests. Introduce TDX's own VM and vCPU structures ('struct kvm_tdx' and 'struct vcpu_tdx' respectively) for KVM to manage and run TDX guests. And instead of building TDX's VM and vCPU structures based on VMX's, build them directly based on 'struct kvm'. As a result, TDX and VMX guests will have different VM size and vCPU size/alignment. Currently, kvm_arch_alloc_vm() uses 'kvm_x86_ops::vm_size' to allocate enough space for the VM structure when creating guest. With TDX guests, ideally, KVM should allocate the VM structure based on the VM type so that the precise size can be allocated for VMX and TDX guests. But this requires more extensive code change. For now, simply choose the maximum size of 'struct kvm_tdx' and 'struct kvm_vmx' for VM structure allocation for both VMX and TDX guests. This would result in small memory waste for each VM which has smaller VM structure size but this is acceptable. For simplicity, use the same way for vCPU allocation too. Otherwise KVM would need to maintain a separate 'kvm_vcpu_cache' for each VM type. Note, updating the 'vt_x86_ops::vm_size' needs to be done before calling kvm_ops_update(), which copies vt_x86_ops to kvm_x86_ops. However this happens before TDX module initialization. Therefore theoretically it is possible that 'kvm_x86_ops::vm_size' is set to size of 'struct kvm_tdx' (when it's larger) but TDX actually fails to initialize at a later time. Again the worst case of this is wasting couple of bytes memory for each VM. KVM could choose to update 'kvm_x86_ops::vm_size' at a later time depending on TDX's status but that would require base KVM module to export either kvm_x86_ops or kvm_ops_update(). Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- - Make to_kvm_tdx() and to_tdx() private to tdx.c (Francesco, Tony) - Add pragma poison for to_vmx() (Paolo) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Get TDX global informationKai Huang
KVM will need to consult some essential TDX global information to create and run TDX guests. Get the global information after initializing TDX. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: VMX: Initialize TDX during KVM module loadKai Huang
Before KVM can use TDX to create and run TDX guests, TDX needs to be initialized from two perspectives: 1) TDX module must be initialized properly to a working state; 2) A per-cpu TDX initialization, a.k.a the TDH.SYS.LP.INIT SEAMCALL must be done on any logical cpu before it can run any other TDX SEAMCALLs. The TDX host core-kernel provides two functions to do the above two respectively: tdx_enable() and tdx_cpu_enable(). There are two options in terms of when to initialize TDX: initialize TDX at KVM module loading time, or when creating the first TDX guest. Choose to initialize TDX during KVM module loading time: Initializing TDX module is both memory and CPU time consuming: 1) the kernel needs to allocate a non-trivial size(~1/256) of system memory as metadata used by TDX module to track each TDX-usable memory page's status; 2) the TDX module needs to initialize this metadata, one entry for each TDX-usable memory page. Also, the kernel uses alloc_contig_pages() to allocate those metadata chunks, because they are large and need to be physically contiguous. alloc_contig_pages() can fail. If initializing TDX when creating the first TDX guest, then there's chance that KVM won't be able to run any TDX guests albeit KVM _declares_ to be able to support TDX. This isn't good for the user. On the other hand, initializing TDX at KVM module loading time can make sure KVM is providing a consistent view of whether KVM can support TDX to the user. Always only try to initialize TDX after VMX has been initialized. TDX is based on VMX, and if VMX fails to initialize then TDX is likely to be broken anyway. Also, in practice, supporting TDX will require part of VMX and common x86 infrastructure in working order, so TDX cannot be enabled alone w/o VMX support. There are two cases that can result in failure to initialize TDX: 1) TDX cannot be supported (e.g., because of TDX is not supported or enabled by hardware, or module is not loaded, or missing some dependency in KVM's configuration); 2) Any unexpected error during TDX bring-up. For the first case only mark TDX is disabled but still allow KVM module to be loaded. For the second case just fail to load the KVM module so that the user can be aware. Because TDX costs additional memory, don't enable TDX by default. Add a new module parameter 'enable_tdx' to allow the user to opt-in. Note, the name tdx_init() has already been taken by the early boot code. Use tdx_bringup() for initializing TDX (and tdx_cleanup() since KVM doesn't actually teardown TDX). They don't match vt_init()/vt_exit(), vmx_init()/vmx_exit() etc but it's not end of the world. Also, once initialized, the TDX module cannot be disabled and enabled again w/o the TDX module runtime update, which isn't supported by the kernel. After TDX is enabled, nothing needs to be done when KVM disables hardware virtualization, e.g., when offlining CPU, or during suspend/resume. TDX host core-kernel code internally tracks TDX status and can handle "multiple enabling" scenario. Similar to KVM_AMD_SEV, add a new KVM_INTEL_TDX Kconfig to guide KVM TDX code. Make it depend on INTEL_TDX_HOST but not replace INTEL_TDX_HOST because in the longer term there's a use case that requires making SEAMCALLs w/o KVM as mentioned by Dan [1]. Link: https://lore.kernel.org/6723fc2070a96_60c3294dc@dwillia2-mobl3.amr.corp.intel.com.notmuch/ [1] Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <162f9dee05c729203b9ad6688db1ca2960b4b502.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: VMX: Refactor VMX module init/exit functionsKai Huang
Add vt_init() and vt_exit() as the new module init/exit functions and refactor existing vmx_init()/vmx_exit() as helper to make room for TDX specific initialization and teardown. To support TDX, KVM will need to enable TDX during KVM module loading time. Enabling TDX requires enabling hardware virtualization first so that all online CPUs (and the new CPU going online) are in post-VMXON state. Currently, the vmx_init() flow is: 1) hv_init_evmcs(), 2) kvm_x86_vendor_init(), 3) Other VMX specific initialization, 4) kvm_init() The kvm_x86_vendor_init() invokes kvm_x86_init_ops::hardware_setup() to do VMX specific hardware setup and calls kvm_update_ops() to initialize kvm_x86_ops to VMX's version. TDX will have its own version for most of kvm_x86_ops callbacks. It would be nice if kvm_x86_init_ops::hardware_setup() could also be used for TDX, but in practice it cannot. The reason is, as mentioned above, TDX initialization requires hardware virtualization having been enabled, which must happen after kvm_update_ops(), but hardware_setup() is done before that. Also, TDX is based on VMX, and it makes sense to only initialize TDX after VMX has been initialized. If VMX fails to initialize, TDX is likely broken anyway. So the new flow of KVM module init function will be: 1) Current VMX initialization code in vmx_init() before kvm_init(), 2) TDX initialization, 3) kvm_init() Split vmx_init() into two parts based on above 1) and 3) so that TDX initialization can fit in between. Make part 1) as the new helper vmx_init(). Introduce vt_init() as the new module init function which calls vmx_init() and kvm_init(). TDX initialization will be added later. Do the same thing for vmx_exit()/vt_exit(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <3f23f24098bdcf42e213798893ffff7cdc7103be.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protectedPaolo Bonzini
KVM_CAP_SYNC_REGS does not make sense for VMs with protected guest state, since the register values cannot actually be written. Return 0 when using the VM-level KVM_CHECK_EXTENSION ioctl, and accordingly return -EINVAL from KVM_RUN if the valid/dirty fields are nonzero. However, on exit from KVM_RUN userspace could have placed a nonzero value into kvm_run->kvm_valid_regs, so check guest_state_protected again and skip store_regs() in that case. Cc: stable@vger.kernel.org Fixes: 517987e3fb19 ("KVM: x86: add fields to struct kvm_arch for CoCo features") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250306202923.646075-1-pbonzini@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Add infrastructure for secure TSCIsaku Yamahata
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit changing TSC offset/multiplier when guest_tsc_protected is true. X86 confidential computing technology defines protected guest TSC so that the VMM can't change the TSC offset/multiplier once vCPU is initialized. SEV-SNP defines Secure TSC as optional, whereas TDX mandates it. KVM has common logic on x86 that tries to guess or adjust TSC offset/multiplier for better guest TSC and TSC interrupt latency at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes (kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR (kvm_set_msr_common()). The current x86 KVM implementation conflicts with protected TSC because the VMM can't change the TSC offset/multiplier. Because KVM emulates the TSC timer or the TSC deadline timer with the TSC offset/multiplier, the TSC timer interrupts is injected to the guest at the wrong time if the KVM TSC offset is different from what the TDX module determined. Originally this issue was found by cyclic test of rt-test [1] as the latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It turned out that the KVM TSC offset is different from what the TDX module determines. Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier somehow, thus keeping the KVM TSC offset/multiplier the same as the value of the TDX module. Writes to MSR_IA32_TSC are also blocked as they amount to a change in the TSC offset. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Push down setting vcpu.arch.user_set_tscIsaku Yamahata
Push down setting vcpu.arch.user_set_tsc to true from kvm_synchronize_tsc() to __kvm_synchronize_tsc() so that the two callers don't have to modify user_set_tsc directly as preparation. Later, prohibit changing TSC synchronization for TDX guests to modify __kvm_synchornize_tsc() change. We don't want to touch caller sites not to change user_set_tsc. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <62b1a7a35d6961844786b6e47e8ecb774af7a228.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: move vm_destroy callback at end of kvm_arch_destroy_vmPaolo Bonzini
TDX needs to free the TDR control structures last, after all paging structures have been torn down; move the vm_destroy callback at a suitable place. The new place is also okay for AMD; the main difference is that the MMU has been torn down and, if anything, that is better done before the SNP ASID is released. Extracted from a patch by Yan Zhao. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-09Merge tag 'kvm-x86-fixes-6.14-rcN.2' of https://github.com/kvm-x86/linux ↵Paolo Bonzini
into HEAD KVM x86 fixes for 6.14-rcN #2 - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow. - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs. - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent. - Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest can enable DebugSwap without KVM's knowledge. - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault. - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror. - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM.
2025-03-04KVM: x86: Remove the unreachable case for 0x80000022 leaf in __do_cpuid_func()Xiaoyao Li
Remove dead/unreachable (and misguided) code in KVM's processing of 0x80000022. The case statement breaks early if PERFMON_V2 isnt supported, i.e. kvm_cpu_cap_has(X86_FEATURE_PERFMON_V2) must be true when KVM reaches the code code to setup EBX. Note, early versions of the patch that became commit 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") didn't break early on lack of PERFMON_V2 support, and instead enumerated the effective number of counters KVM could emulate. All of that code was flawed, e.g. the APM explicitly states EBX is valid only for v2. Performance Monitoring Version 2 supported. When set, CPUID_Fn8000_0022_EBX reports the number of available performance counters. When the flaw of not respecting v2 support was addressed, the misguided stuffing of the number of counters got left behind. Link: https://lore.kernel.org/all/20220919093453.71737-4-likexu@tencent.com Fixes: 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250304082314.472202-2-xiaoyao.li@intel.com [sean: elaborate on the situation a bit more, add Fixes] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-04KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVMXiaoyao Li
Fix a goof where KVM sets CPUID.0x80000022.EAX to CPUID.0x80000022.EBX instead of zeroing both when PERFMON_V2 isn't supported by KVM. In practice, barring a buggy CPU (or vCPU model when running nested) only the !enable_pmu case is affected, as KVM always supports PERFMON_V2 if it's available in hardware, i.e. CPUID.0x80000022.EBX will be '0' if PERFMON_V2 is unsupported. For the !enable_pmu case, the bug is relatively benign as KVM will refuse to enable PMU capabilities, but a VMM that reflects KVM's supported CPUID into the guest could inadvertently induce #GPs in the guest due to advertising support for MSRs that KVM refuses to emulate. Fixes: 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250304082314.472202-3-xiaoyao.li@intel.com [sean: massage shortlog and changelog, tag for stable] Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-04KVM: VMX: Use named operands in inline asmJosh Poimboeuf
Convert the non-asm-goto version of the inline asm in __vmcs_readl() to use named operands, similar to its asm-goto version. Do this in preparation of changing the ASM_CALL_CONSTRAINT primitive. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sean Christopherson <seanjc@google.com> Cc: linux-kernel@vger.kernel.org
2025-03-03KVM: VMX: Extract checks on entry/exit control pairs to a helper macroSean Christopherson
Extract the checking of entry/exit pairs to a helper macro so that the code can be reused to process the upcoming "secondary" exit controls (the primary exit controls field is out of bits). Use a macro instead of a function to support different sized variables (all secondary exit controls will be optional and so the MSR doesn't have the fixed-0/fixed-1 split). Taking the largest size as input is trivial, but handling the modification of KVM's to-be-used controls is much trickier, e.g. would require bitmap games to clear bits from a 32-bit bitmap vs. a 64-bit bitmap. Opportunistically add sanity checks to ensure the size of the controls match (yay, macro!), e.g. to detect bugs where KVM passes in the pairs for primary exit controls, but its variable for the secondary exit controls. To help users triage mismatches, print the control bits that are checked, not just the actual value. For the foreseeable future, that provides enough information for a user to determine which fields mismatched. E.g. until secondary entry controls comes along, all entry bits and thus all error messages are guaranteed to be unique. To avoid returning from a macro, which can get quite dangerous, simply process all pairs even if error_on_inconsistent_vmcs_config is set. The speed at which KVM rejects module load is not at all interesting. Keep the error message a "once" printk, even though it would be nice to print out all mismatching pairs. In practice, the most likely scenario is that a single pair will mismatch on all CPUs. Printing all mismatches generates redundant messages in that situation, and can be extremely noisy on systems with large numbers of CPUs. If a CPU has multiple mismatches, not printing every bad pair is the least of the user's concerns. Cc: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250227005353.3216123-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Invalidate "next" SNP VMSA GPA even on failureSean Christopherson
When processing an SNP AP Creation event, invalidate the "next" VMSA GPA even if acquiring the page/pfn for the new VMSA fails. In practice, the next GPA will never be used regardless of whether or not its invalidated, as the entire flow is guarded by snp_ap_waiting_for_reset, and said guard and snp_vmsa_gpa are always written as a pair. But that's really hard to see in the code. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Use guard(mutex) to simplify SNP vCPU state updatesSean Christopherson
Use guard(mutex) in sev_snp_init_protected_guest_state() and pull in its lock-protected inner helper. Without an unlock trampoline (and even with one), there is no real need for an inner helper. Eliminating the helper also avoids having to fixup the open coded "lockdep" WARN_ON(). Opportunistically drop the error message if KVM can't obtain the pfn for the new target VMSA. The error message provides zero information that can't be gleaned from the fact that the vCPU is stuck. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Mark VMCB dirty before processing incoming snp_vmsa_gpaSean Christopherson
Mark the VMCB dirty, i.e. zero control.clean, prior to handling the new VMSA. Nothing in the VALID_PAGE() case touches control.clean, and isolating the VALID_PAGE() code will allow simplifying the overall logic. Note, the VMCB probably doesn't need to be marked dirty when the VMSA is invalid, as KVM will disallow running the vCPU in such a state. But it also doesn't hurt anything. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Use guard(mutex) to simplify SNP AP Creation error handlingSean Christopherson
Use guard(mutex) in sev_snp_ap_creation() and modify the error paths to return directly instead of jumping to a common exit point. No functional change intended. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Simplify request+kick logic in SNP AP Creation handlingSean Christopherson
Drop the local "kick" variable and the unnecessary "fallthrough" logic from sev_snp_ap_creation(), and simply pivot on the request when deciding whether or not to immediate force a state update on the target vCPU. No functional change intended. Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Require AP's "requested" SEV_FEATURES to match KVM's viewSean Christopherson
When handling an "AP Create" event, return an error if the "requested" SEV features for the vCPU don't exactly match KVM's view of the VM-scoped features. There is no known use case for heterogeneous SEV features across vCPUs, and while KVM can't actually enforce an exact match since the value in RAX isn't guaranteed to match what the guest shoved into the VMSA, KVM can at least avoid knowingly letting the guest run in an unsupported state. E.g. if a VM is created with DebugSwap disabled, KVM will intercept #DBs and DRs for all vCPUs, even if an AP is "created" with DebugSwap enabled in its VMSA. Note, the GHCB spec only "requires" that "AP use the same interrupt injection mechanism as the BSP", but given the disaster that is DebugSwap and SEV_FEATURES in general, it's safe to say that AMD didn't consider all possible complications with mismatching features between the BSP and APs. Opportunistically fold the check into the relevant request flavors; the "request < AP_DESTROY" check is just a bizarre way of implementing the AP_CREATE_ON_INIT => AP_CREATE fallthrough. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Don't change target vCPU state on AP Creation VMGEXIT errorSean Christopherson
If KVM rejects an AP Creation event, leave the target vCPU state as-is. Nothing in the GHCB suggests the hypervisor is *allowed* to muck with vCPU state on failure, let alone required to do so. Furthermore, kicking only in the !ON_INIT case leads to divergent behavior, and even the "kick" case is non-deterministic. E.g. if an ON_INIT request fails, the guest can successfully retry if the fixed AP Creation request is made prior to sending INIT. And if a !ON_INIT fails, the guest can successfully retry if the fixed AP Creation request is handled before the target vCPU processes KVM's KVM_REQ_UPDATE_PROTECTED_GUEST_STATE. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Refuse to attempt VRMUN if an SEV-ES+ guest has an invalid VMSASean Christopherson
Explicitly reject KVM_RUN with KVM_EXIT_FAIL_ENTRY if userspace "coerces" KVM into running an SEV-ES+ guest with an invalid VMSA, e.g. by modifying a vCPU's mp_state to be RUNNABLE after an SNP vCPU has undergone a Destroy event. On Destroy or failed Create, KVM marks the vCPU HALTED so that *KVM* doesn't run the vCPU, but nothing prevents a misbehaving VMM from manually making the vCPU RUNNABLE via KVM_SET_MP_STATE. Attempting VMRUN with an invalid VMSA should be harmless, but knowingly executing VMRUN with bad control state is at best dodgy. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3Sean Christopherson
Never rely on the CPU to restore/load host DR0..DR3 values, even if the CPU supports DebugSwap, as there are no guarantees that SNP guests will actually enable DebugSwap on APs. E.g. if KVM were to rely on the CPU to load DR0..DR3 and skipped them during hw_breakpoint_restore(), KVM would run with clobbered-to-zero DRs if an SNP guest created APs without DebugSwap enabled. Update the comment to explain the dangers, and hopefully prevent breaking KVM in the future. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Save host DR masks on CPUs with DebugSwapSean Christopherson
When running SEV-SNP guests on a CPU that supports DebugSwap, always save the host's DR0..DR3 mask MSR values irrespective of whether or not DebugSwap is enabled, to ensure the host values aren't clobbered by the CPU. And for now, also save DR0..DR3, even though doing so isn't necessary (see below). SVM_VMGEXIT_AP_CREATE is deeply flawed in that it allows the *guest* to create a VMSA with guest-controlled SEV_FEATURES. A well behaved guest can inform the hypervisor, i.e. KVM, of its "requested" features, but on CPUs without ALLOWED_SEV_FEATURES support, nothing prevents the guest from lying about which SEV features are being enabled (or not!). If a misbehaving guest enables DebugSwap in a secondary vCPU's VMSA, the CPU will load the DR0..DR3 mask MSRs on #VMEXIT, i.e. will clobber the MSRs with '0' if KVM doesn't save its desired value. Note, DR0..DR3 themselves are "ok", as DR7 is reset on #VMEXIT, and KVM restores all DRs in common x86 code as needed via hw_breakpoint_restore(). I.e. there is no risk of host DR0..DR3 being clobbered (when it matters). However, there is a flaw in the opposite direction; because the guest can lie about enabling DebugSwap, i.e. can *disable* DebugSwap without KVM's knowledge, KVM must not rely on the CPU to restore DRs. Defer fixing that wart, as it's more of a documentation issue than a bug in the code. Note, KVM added support for DebugSwap on commit d1f85fbe836e ("KVM: SEV: Enable data breakpoints in SEV-ES"), but that is not an appropriate Fixes, as the underlying flaw exists in hardware, not in KVM. I.e. all kernels that support SEV-SNP need to be patched, not just kernels with KVM's full support for DebugSwap (ignoring that DebugSwap support landed first). Opportunistically fix an incorrect statement in the comment; on CPUs without DebugSwap, the CPU does NOT save or load debug registers, i.e. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Cc: Naveen N Rao <naveen@kernel.org> Cc: Kim Phillips <kim.phillips@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-01kvm: retry nx_huge_page_recovery_thread creationKeith Busch
A VMM may send a non-fatal signal to its threads, including vCPU tasks, at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU task receives the signal while its trying to spawn the huge page recovery vhost task, then KVM_RUN will fail due to copy_process() returning -ERESTARTNOINTR. Rework call_once() to mark the call complete if and only if the called function succeeds, and plumb the function's true error code back to the call_once() invoker. This provides userspace with the correct, non-fatal error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge page task. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> [implemented the kvm user side] Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-3-kbusch@meta.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-01vhost: return task creation error instead of NULLKeith Busch
Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-28KVM: x86: Always set mp_state to RUNNABLE on wakeup from HLTSean Christopherson
When emulating HLT and a wake event is already pending, explicitly mark the vCPU RUNNABLE (via kvm_set_mp_state()) instead of assuming the vCPU is already in the appropriate state. Barring a KVM bug, it should be impossible for the vCPU to be in a non-RUNNABLE state, but there is no advantage to relying on that to hold true, and ensuring the vCPU is made RUNNABLE avoids non-deterministic behavior with respect to pv_unhalted. E.g. if the vCPU is not already RUNNABLE, then depending on when pv_unhalted is set, KVM could either leave the vCPU in the non-RUNNABLE state (set before __kvm_emulate_halt()), or transition the vCPU to HALTED and then RUNNABLE (pv_unhalted set after the kvm_vcpu_has_events() check). Link: https://lore.kernel.org/r/20250224174156.2362059-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQsSean Christopherson
Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle debugctl bits from IRQ context, e.g. when enabling/disabling events via smp_call_function_single(). Taking the snapshot (long) before IRQs are disabled could result in KVM effectively clobbering DEBUGCTL due to using a stale snapshot. Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabledSean Christopherson
Manually load the guest's DEBUGCTL prior to VMRUN (and restore the host's value on #VMEXIT) if it diverges from the host's value and LBR virtualization is disabled, as hardware only context switches DEBUGCTL if LBR virtualization is fully enabled. Running the guest with the host's value has likely been mildly problematic for quite some time, e.g. it will result in undesirable behavior if BTF diverges (with the caveat that KVM now suppresses guest BTF due to lack of support). But the bug became fatal with the introduction of Bus Lock Trap ("Detect" in kernel paralance) support for AMD (commit 408eb7417a92 ("x86/bus_lock: Add support for AMD")), as a bus lock in the guest will trigger an unexpected #DB. Note, suppressing the bus lock #DB, i.e. simply resuming the guest without injecting a #DB, is not an option. It wouldn't address the general issue with DEBUGCTL, e.g. for things like BTF, and there are other guest-visible side effects if BusLockTrap is left enabled. If BusLockTrap is disabled, then DR6.BLD is reserved-to-1; any attempts to clear it by software are ignored. But if BusLockTrap is enabled, software can clear DR6.BLD: Software enables bus lock trap by setting DebugCtl MSR[BLCKDB] (bit 2) to 1. When bus lock trap is enabled, ... The processor indicates that this #DB was caused by a bus lock by clearing DR6[BLD] (bit 11). DR6[11] previously had been defined to be always 1. and clearing DR6.BLD is "sticky" in that it's not set (i.e. lowered) by other #DBs: All other #DB exceptions leave DR6[BLD] unmodified E.g. leaving BusLockTrap enable can confuse a legacy guest that writes '0' to reset DR6. Reported-by: rangemachine@gmail.com Reported-by: whanos@sergal.fun Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219787 Closes: https://lore.kernel.org/all/bug-219787-28872@https.bugzilla.kernel.org%2F Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: x86: Snapshot the host's DEBUGCTL in common x86Sean Christopherson
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot. Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Suppress DEBUGCTL.BTF on AMDSean Christopherson
Mark BTF as reserved in DEBUGCTL on AMD, as KVM doesn't actually support BTF, and fully enabling BTF virtualization is non-trivial due to interactions with the emulator, guest_debug, #DB interception, nested SVM, etc. Don't inject #GP if the guest attempts to set BTF, as there's no way to communicate lack of support to the guest, and instead suppress the flag and treat the WRMSR as (partially) unsupported. In short, make KVM behave the same on AMD and Intel (VMX already squashes BTF). Note, due to other bugs in KVM's handling of DEBUGCTL, the only way BTF has "worked" in any capacity is if the guest simultaneously enables LBRs. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective valueSean Christopherson
Drop bits 5:2 from the guest's effective DEBUGCTL value, as AMD changed the architectural behavior of the bits and broke backwards compatibility. On CPUs without BusLockTrap (or at least, in APMs from before ~2023), bits 5:2 controlled the behavior of external pins: Performance-Monitoring/Breakpoint Pin-Control (PBi)—Bits 5:2, read/write. Software uses thesebits to control the type of information reported by the four external performance-monitoring/breakpoint pins on the processor. When a PBi bit is cleared to 0, the corresponding external pin (BPi) reports performance-monitor information. When a PBi bit is set to 1, the corresponding external pin (BPi) reports breakpoint information. With the introduction of BusLockTrap, presumably to be compatible with Intel CPUs, AMD redefined bit 2 to be BLCKDB: Bus Lock #DB Trap (BLCKDB)—Bit 2, read/write. Software sets this bit to enable generation of a #DB trap following successful execution of a bus lock when CPL is > 0. and redefined bits 5:3 (and bit 6) as "6:3 Reserved MBZ". Ideally, KVM would treat bits 5:2 as reserved. Defer that change to a feature cleanup to avoid breaking existing guest in LTS kernels. For now, drop the bits to retain backwards compatibility (of a sort). Note, dropping bits 5:2 is still a guest-visible change, e.g. if the guest is enabling LBRs *and* the legacy PBi bits, then the state of the PBi bits is visible to the guest, whereas now the guest will always see '0'. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Inject #GP if memory operand for INVPCID is non-canonicalSean Christopherson
Inject a #GP if the memory operand received by INVCPID is non-canonical. The APM clearly states that the intercept takes priority over all #GP checks except the CPL0 restriction. Of course, that begs the question of how the CPU generates a linear address in the first place. Tracing confirms that EXITINFO1 does hold a linear address, at least for 64-bit mode guests (hooray GS prefix). Unfortunately, the APM says absolutely nothing about the EXITINFO fields for INVPCID intercepts, so it's not at all clear what's supposed to happen. Add a FIXME to call out that KVM still does the wrong thing for 32-bit guests, and if the stack segment is used for the memory operand. Cc: Babu Moger <babu.moger@amd.com> Cc: Jim Mattson <jmattson@google.com> Fixes: 4407a797e941 ("KVM: SVM: Enable INVPCID feature on AMD") Link: https://lore.kernel.org/r/20250224174522.2363400-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: VMX: Reject KVM_RUN if userspace forces emulation during nested VM-EnterSean Christopherson
Extend KVM's restrictions on userspace forcing "emulation required" at odd times to cover stuffing invalid guest state while a nested run is pending. Clobbering guest state while KVM is in the middle of emulating VM-Enter is nonsensical, as it puts the vCPU into an architecturally impossible state, and will trip KVM's sanity check that guards against KVM bugs, e.g. helps detect missed VMX consistency checks. WARNING: CPU: 3 PID: 6336 at arch/x86/kvm/vmx/vmx.c:6480 __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6480 [inline] WARNING: CPU: 3 PID: 6336 at arch/x86/kvm/vmx/vmx.c:6480 vmx_handle_exit+0x40f/0x1f70 arch/x86/kvm/vmx/vmx.c:6637 Modules linked in: CPU: 3 UID: 0 PID: 6336 Comm: syz.0.73 Not tainted 6.13.0-rc1-syzkaller-00316-gb5f217084ab3 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:__vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6480 [inline] RIP: 0010:vmx_handle_exit+0x40f/0x1f70 arch/x86/kvm/vmx/vmx.c:6637 <TASK> vcpu_enter_guest arch/x86/kvm/x86.c:11081 [inline] vcpu_run+0x3047/0x4f50 arch/x86/kvm/x86.c:11242 kvm_arch_vcpu_ioctl_run+0x44a/0x1740 arch/x86/kvm/x86.c:11560 kvm_vcpu_ioctl+0x6ce/0x1520 virt/kvm/kvm_main.c:4340 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl fs/ioctl.c:892 [inline] __x64_sys_ioctl+0x190/0x200 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> Reported-by: syzbot+ac0bc3a70282b4d586cc@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67598fb9.050a0220.17f54a.003b.GAE@google.com Debugged-by: James Houghton <jthoughton@google.com> Tested-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250224171409.2348647-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadowSean Christopherson
Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff() so that VMRUN is not executed in an STI shadow. AMD CPUs have a quirk (some would say "bug"), where the STI shadow bleeds into the guest's intr_state field if a #VMEXIT occurs during injection of an event, i.e. if the VMRUN doesn't complete before the subsequent #VMEXIT. The spurious "interrupts masked" state is relatively benign, as it only occurs during event injection and is transient. Because KVM is already injecting an event, the guest can't be in HLT, and if KVM is querying IRQ blocking for injection, then KVM would need to force an immediate exit anyways since injecting multiple events is impossible. However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the spurious STI shadow is visible to L1 when running a nested VM, which can trip sanity checks, e.g. in VMware's VMM. Hoist the STI+CLI all the way to C code, as the aforementioned calls to guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already possible. I.e. if there's kernel code that is confused by running with RFLAGS.IF=1, then it's already a problem. In practice, since GIF=0 also blocks NMIs, the only change in exposure to non-KVM code (relative to surrounding VMRUN with STI+CLI) is exception handling code, and except for the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path are fatal. Use the "raw" variants to enable/disable IRQs to avoid tracing in the "no instrumentation" code; the guest state helpers also take care of tracing IRQ state. Oppurtunstically document why KVM needs to do STI in the first place. Reported-by: Doug Covelli <doug.covelli@broadcom.com> Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com Fixes: f14eec0a3203 ("KVM: SVM: move more vmentry code to assembly") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: x86/tdp_mmu: Remove tdp_mmu_for_each_pte()Nikolay Borisov
That macro acts as a different name for for_each_tdp_pte, apart from adding cognitive load it doesn't bring any value. Let's remove it. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250226074131.312565-1-nik.borisov@suse.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bitsSean Christopherson
Define independent macros for the RWX protection bits that are enumerated via EXIT_QUALIFICATION for EPT Violations, and tie them to the RWX bits in EPT entries via compile-time asserts. Piggybacking the EPTE defines works for now, but it creates holes in the EPT_VIOLATION_xxx macros and will cause headaches if/when KVM emulates Mode-Based Execution (MBEC), or any other features that introduces additional protection information. Opportunistically rename EPT_VIOLATION_RWX_MASK to EPT_VIOLATION_PROT_MASK so that it doesn't become stale if/when MBEC support is added. No functional change intended. Cc: Jon Kohler <jon@nutanix.com> Cc: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250227000705.3199706-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-27KVM: nVMX: Always use IBPB to properly virtualize IBRSYosry Ahmed
On synthesized nested VM-exits in VMX, an IBPB is performed if IBRS is advertised to the guest to properly provide separate prediction domains for L1 and L2. However, this is currently conditional on X86_FEATURE_USE_IBPB, which depends on the host spectre_v2_user mitigation. In short, if spectre_v2_user=no, IBRS is not virtualized correctly and L1 becomes susceptible to attacks from L2. Fix this by performing the IBPB regardless of X86_FEATURE_USE_IBPB. Fixes: 2e7eab81425a ("KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS") Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250227012712.3193063-6-yosry.ahmed@linux.dev
2025-02-27x86/bugs: Use a static branch to guard IBPB on vCPU switchYosry Ahmed
Instead of using X86_FEATURE_USE_IBPB to guard the IBPB execution in KVM when a new vCPU is loaded, introduce a static branch, similar to switch_mm_*_ibpb. This makes it obvious in spectre_v2_user_select_mitigation() what exactly is being toggled, instead of the unclear X86_FEATURE_USE_IBPB (which will be shortly removed). It also provides more fine-grained control, making it simpler to change/add paths that control the IBPB in the vCPU switch path without affecting other IBPBs. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250227012712.3193063-5-yosry.ahmed@linux.dev
2025-02-27x86/bugs: Move the X86_FEATURE_USE_IBPB check into callersYosry Ahmed
indirect_branch_prediction_barrier() only performs the MSR write if X86_FEATURE_USE_IBPB is set, using alternative_msr_write(). In preparation for removing X86_FEATURE_USE_IBPB, move the feature check into the callers so that they can be addressed one-by-one, and use X86_FEATURE_IBPB instead to guard the MSR write. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250227012712.3193063-2-yosry.ahmed@linux.dev
2025-02-26KVM: Drop kvm_arch_sync_events() now that all implementations are nopsSean Christopherson
Remove kvm_arch_sync_events() now that x86 no longer uses it (no other arch has ever used it). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20250224235542.2562848-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: x86: Fold guts of kvm_arch_sync_events() into kvm_arch_pre_destroy_vm()Sean Christopherson
Fold the guts of kvm_arch_sync_events() into kvm_arch_pre_destroy_vm(), as the kvmclock and PIT background workers only need to be stopped before destroying vCPUs (to avoid accessing vCPUs as they are being freed); it's a-ok for them to be running while the VM is visible on the global vm_list. Note, the PIT also needs to be stopped before IRQ routing is freed (because KVM's IRQ routing is garbage and assumes there is always non-NULL routing). Opportunistically add comments to explain why KVM stops/frees certain assets early. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: x86: Unload MMUs during vCPU destruction, not beforeSean Christopherson
When destroying a VM, unload a vCPU's MMUs as part of normal vCPU freeing, instead of as a separate prepratory action. Unloading MMUs ahead of time is a holdover from commit 7b53aa565084 ("KVM: Fix vcpu freeing for guest smp"), which "fixed" a rather egregious flaw where KVM would attempt to free *all* MMU pages when destroying a vCPU. At the time, KVM would spin on all MMU pages in a VM when free a single vCPU, and so would hang due to the way KVM pins and zaps root pages (roots are invalidated but not freed if they are pinned by a vCPU). static void free_mmu_pages(struct kvm_vcpu *vcpu) { struct kvm_mmu_page *page; while (!list_empty(&vcpu->kvm->active_mmu_pages)) { page = container_of(vcpu->kvm->active_mmu_pages.next, struct kvm_mmu_page, link); kvm_mmu_zap_page(vcpu->kvm, page); } free_page((unsigned long)vcpu->mmu.pae_root); } Now that KVM doesn't try to free all MMU pages when destroying a single vCPU, there's no need to unpin roots prior to destroying a vCPU. Note! While KVM mostly destroys all MMUs before calling kvm_arch_destroy_vm() (see commit f00be0cae4e6 ("KVM: MMU: do not free active mmu pages in free_mmu_pages()")), unpinning MMU roots during vCPU destruction will unfortunately trigger remote TLB flushes, i.e. will try to send requests to all vCPUs. Happily, thanks to commit 27592ae8dbe4 ("KVM: Move wiping of the kvm->vcpus array to common code"), that's a non-issue as freed vCPUs are naturally skipped by xa_for_each_range(), i.e. by kvm_for_each_vcpu(). Prior to that commit, KVM x86 rather stupidly freed vCPUs one-by-one, and _then_ nullified them, one-by-one. I.e. triggering a VM-wide request would hit a use-after-free. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: x86: Don't load/put vCPU when unloading its MMU during teardownSean Christopherson
Don't load (and then put) a vCPU when unloading its MMU during VM destruction, as nothing in kvm_mmu_unload() accesses vCPU state beyond the root page/address of each MMU, i.e. can't possible need to run with the vCPU loaded. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250224235542.2562848-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26x86/bugs: KVM: Add support for SRSO_MSR_FIXBorislav Petkov
Add support for CPUID Fn8000_0021_EAX[31] (SRSO_MSR_FIX). If this bit is 1, it indicates that software may use MSR BP_CFG[BpSpecReduce] to mitigate SRSO. Enable BpSpecReduce to mitigate SRSO across guest/host boundaries. Switch back to enabling the bit when virtualization is enabled and to clear the bit when virtualization is disabled because using a MSR slot would clear the bit when the guest is exited and any training the guest has done, would potentially influence the host kernel when execution enters the kernel and hasn't VMRUN the guest yet. More detail on the public thread in Link below. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241202120416.6054-1-bp@kernel.org