summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2025-06-20KVM: nSVM: Merge MSRPM in 64-bit chunks on 64-bit kernelsSean Christopherson
When merging L0 and L1 MSRPMs as part of nested VMRUN emulation, access the bitmaps using "unsigned long" chunks, i.e. use 8-byte access for 64-bit kernels instead of arbitrarily working on 4-byte chunks. Opportunistically rename local variables in nested_svm_merge_msrpm() to more precisely/accurately reflect their purpose ("offset" in particular is extremely ambiguous). Link: https://lore.kernel.org/r/20250610225737.156318-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Return -EINVAL instead of MSR_INVALID to signal out-of-range MSRSean Christopherson
Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate that the MSR isn't covered by one of the (currently) three MSRPM ranges, and delete the MSR_INVALID macro now that all users are gone. Link: https://lore.kernel.org/r/20250610225737.156318-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Access MSRPM in 4-byte chunks only for merging L0 and L1 bitmapsSean Christopherson
Access the MSRPM using u32/4-byte chunks (and appropriately adjusted offsets) only when merging L0 and L1 bitmaps as part of emulating VMRUN. The only reason to batch accesses to MSRPMs is to avoid the overhead of uaccess operations (e.g. STAC/CLAC and bounds checks) when reading L1's bitmap pointed at by vmcb12. For all other uses, either per-bit accesses are more than fast enough (no uaccess), or KVM is only accessing a single bit (nested_svm_exit_handled_msr()) and so there's nothing to batch. In addition to (hopefully) documenting the uniqueness of the merging code, restricting chunked access to _just_ the merging code will allow for increasing the chunk size (to unsigned long) with minimal risk. Link: https://lore.kernel.org/r/20250610225737.156318-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Store MSRPM pointer as "void *" instead of "u32 *"Sean Christopherson
Store KVM's MSRPM pointers as "void *" instead of "u32 *" to guard against directly accessing the bitmaps outside of code that is explicitly written to access the bitmaps with a specific type. Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of open coding an equivalent. Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Move svm_msrpm_offset() to nested.cSean Christopherson
Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the u32-index offsets is nested virtualization specific. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop explicit check on MSRPM offset when emulating SEV-ES accessesSean Christopherson
Now that msr_write_intercepted() defaults to true, i.e. accurately reflects hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset() check that guarded against calling msr_write_intercepted() with a "bad" index. Opportunistically clean up the helper's formatting. Link: https://lore.kernel.org/r/20250610225737.156318-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Merge "after set CPUID" intercept recalc helpersSean Christopherson
Merge svm_recalc_intercepts_after_set_cpuid() and svm_recalc_instruction_intercepts() such that the "after set CPUID" helper simply invokes the type-specific helpers (MSRs vs. instructions), i.e. make svm_recalc_intercepts_after_set_cpuid() a single entry point for all intercept updates that need to be performed after a CPUID change. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Fold svm_vcpu_init_msrpm() into its sole callerSean Christopherson
Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there is only the one caller (and because the "init" misnomer is even more misleading than it was in the past). No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Rename init_vmcb_after_set_cpuid() to make it intercepts specificSean Christopherson
Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid() to more precisely describe its role. Strictly speaking, the name isn't perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an intercept, but practically speaking it's close enough. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts()Sean Christopherson
Rename msr_filter_changed() to recalc_msr_intercepts() and drop the trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc helper to react to the new userspace filter. No functional change intended. Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter changeSean Christopherson
On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM *must* be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Francesco Lavra <francescolavra.fl@gmail.com> Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Manually recalc all MSR intercepts on userspace MSR filter changeSean Christopherson
On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM *must* be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Borislav Petkov <bp@alien8.de> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move definition of X2APIC_MSR() to lapic.hSean Christopherson
Dedup the definition of X2APIC_MSR and put it in the local APIC code where it belongs. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop "always" flag from list of possible passthrough MSRsSean Christopherson
Drop the "always" flag from the array of possible passthrough MSRs, and instead manually initialize the permissions for the handful of MSRs that KVM passes through by default. In addition to cutting down on boilerplate copy+paste code and eliminating a misleading flag (the MSRs aren't always passed through, e.g. thanks to MSR filters), this will allow for removing the direct_access_msrs array entirely. Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Pass through GHCB MSR if and only if VM is an SEV-ES guestSean Christopherson
Disable interception of the GHCB MSR if and only if the VM is an SEV-ES guest. While the exact behavior is completely undocumented in the APM, common sense and testing on SEV-ES capable CPUs says that accesses to the GHCB from non-SEV-ES guests will #GP. I.e. from the guest's perspective, no functional change intended. Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Implement and adopt VMX style MSR intercepts APIsSean Christopherson
Add and use SVM MSR interception APIs (in most paths) to match VMX's APIs and nomenclature. Specifically, add SVM variants of: vmx_disable_intercept_for_msr(vcpu, msr, type) vmx_enable_intercept_for_msr(vcpu, msr, type) vmx_set_intercept_for_msr(vcpu, msr, type, intercept) to eventually replace SVM's single helper: set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write) which is awkward to use (in all cases, KVM either applies the same logic for both reads and writes, or intercepts one of read or write), and is unintuitive due to using '0' to indicate interception should be *set*. Keep the guts of the old API for the moment to avoid churning the MSR filter code, as that mess will be overhauled in the near future. Leave behind a temporary comment to call out that the shadow bitmaps have inverted polarity relative to the bitmaps consumed by hardware. No functional change intended. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsetsSean Christopherson
Add macro-built helpers for testing, setting, and clearing MSRPM entries without relying on precomputed offsets. This sets the stage for eventually removing general KVM use of precomputed offsets, which are quite confusing and rather inefficient for the vast majority of KVM's usage. Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed offsets and accesses is at best unnecessary, and at worst introduces extra operations to retrieve the individual bit from within the offset u32 value. And simply calling them "offsets" is very confusing, as the "unit" of the offset isn't immediately obvious. Use the new helpers in set_msr_interception_bitmap() and msr_write_intercepted() to verify the math and operations, but keep the existing offset-based logic in set_msr_interception_bitmap() to sanity check the "clear" and "set" operations. Manipulating MSR interceptions isn't a hot path and no kernel release is ever expected to contain this specific version of set_msr_interception_bitmap() (it will be removed entirely in the near future). Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough"Sean Christopherson
Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough" MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge vmcb01 with vmcb12 and write the result to vmcb02. This will eventually allow for the removal of svm_vcpu_init_msrpm(). Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the worst case scenario from dropping the call to svm_vcpu_init_msrpm() should be that KVM would fail to passthrough MSRs to L2. Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Omit SEV-ES specific passthrough MSRs from L0+L1 bitmap mergeSean Christopherson
Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only for SEV-ES guests. KVM doesn't support nested virtualization for SEV-ES, and likely never will. Link: https://lore.kernel.org/r/20250610225737.156318-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmapsSean Christopherson
Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually allow for the removal of direct_access_msrs, as the only path where tracking the offsets is truly justified is the merge for nested SVM, where merging in chunks is an easy way to batch uaccess reads/writes. Opportunistically omit the x2APIC MSRs from the merge-specific array instead of filtering them out at runtime. Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and TSC_AUX is mutually exclusive with nested virtualization, as KVM passes through those MSRs only for SEV-ES guests, and KVM doesn't support nested virtualization for SEV+ guests. Defer removing those MSRs to a future cleanup in order to make this refactoring as benign as possible. Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Clean up macros related to architectural MSRPM definitionsSean Christopherson
Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding helpers that are available to SVM code, and opportunistically replace a variety of open-coded literals with (hopefully) informative macros. Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified with "SVM_MSRPM". Deliberately leave the ranges as open coded literals, as using macros to define the ranges actually introduces more potential failure points, since both the definitions and the usage have to be careful to use the correct index. The lack of clear intent behind the ranges will be addressed in future patches. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Massage name and param of helper that merges vmcb01 and vmcb12 MSRPMsSean Christopherson
Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better capture its role, and opportunistically feed it @vcpu instead of @svm, as grabbing "svm" only to turn around and grab svm->vcpu is rather silly. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Use non-atomic bit ops to manipulate "shadow" MSR interceptsSean Christopherson
Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores), as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is held. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Kill the VM instead of the host if MSR interception is buggySean Christopherson
WARN and kill the VM instead of panicking the host if KVM attempts to set or query MSR interception for an unsupported MSR. Accessing the MSR interception bitmaps only meaningfully affects post-VMRUN behavior, and KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN, i.e. there is no need to panic the entire host. Opportunistically move the sanity checks about their use to index into the MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to doing that _and_ generating an out-of-bounds load. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrsSean Christopherson
Drop the unnecessary and dangerous value-terminated behavior of direct_access_msrs, and simply iterate over the actual size of the array. The use in svm_set_x2apic_msr_interception() is especially sketchy, as it relies on unused capacity being zero-initialized, and '0' being outside the range of x2APIC MSRs. To ensure the array and shadow_msr_intercept stay synchronized, simply assert that their sizes are identical (note the six 64-bit-only MSRs). Note, direct_access_msrs will soon be removed entirely; keeping the assert synchronized with the array isn't expected to be along-term maintenance burden. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Tag MSR bitmap initialization helpers with __initSean Christopherson
Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used only during hardware setup to map potential passthrough MSRs to offsets in the bitmap. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Don't BUG if setting up the MSR intercept bitmaps failsSean Christopherson
WARN and reject module loading if there is a problem with KVM's MSR interception bitmaps. Panicking the host in this situation is inexcusable since it is trivially easy to propagate the error up the stack. Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Allocate IOPM pages after initial setup in svm_hardware_setup()Sean Christopherson
Allocate pages for the IOPM after initial setup has been completed in svm_hardware_setup(), so that sanity checks can be added in the setup flow without needing to free the IOPM pages. The IOPM is only referenced (via iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need to allocate it early on. No functional change intended (beyond the obvious ordering differences, e.g. if the allocation fails). Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Disable interception of SPEC_CTRL iff the MSR exists for the guestSean Christopherson
Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's CPUID model. Letting the guest access SPEC_CTRL is generally benign, but the guest would see inconsistent behavior if KVM happened to emulate an access to the MSR. Fixes: d00b99c514b3 ("KVM: SVM: Add support for Virtual SPEC_CTRL") Reported-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guestMaxim Levitsky
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting while running the guest. When running with the "default treatment of SMIs" in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that is visible to host (non-SMM) software, and instead transitions directly from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched by hardware on SMI or RSM, i.e. SMM will run with whatever value was resident in hardware at the time of the SMI. Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting events while the CPU is executing in SMM, which can pollute profiling and potentially leak information into the guest. Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner run loop, as the bit can be toggled in IRQ context via IPI callback (SMP function call), by way of /sys/devices/cpu/freeze_on_smi. Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs, i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and at worst could lead to undesirable behavior in the future if AMD CPUs ever happened to pick up a collision with the bit. Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module owns and controls GUEST_IA32_DEBUGCTL. WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the lack of handling isn't a KVM bug (TDX already WARNs on any run_flag). Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state(). Doing so avoids the need to track host_debugctl on a per-VMCS basis, as GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't have actually entered the guest, vcpu_enter_guest() will have run with vmcs02 active and thus could result in vmcs01 being run with a stale value. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIsMaxim Levitsky
Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state into the guest, and without needing to copy+paste the FREEZE_IN_SMM logic into every patch that accesses GUEST_IA32_DEBUGCTL. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: massage changelog, make inline, use in all prepare_vmcs02() cases] Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-EnterMaxim Levitsky
Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports a subset of hardware functionality, i.e. KVM can't rely on hardware to detect illegal/unsupported values. Failure to check the vmcs12 value would allow the guest to load any harware-supported value while running L2. Take care to exempt BTF and LBR from the validity check in order to match KVM's behavior for writes via WRMSR, but without clobbering vmcs12. Even if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR are being intercepted. Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set *and* writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but that would incur non-trivial complexity and wouldn't change the fact that KVM's handling of DEBUGCTL is blatantly broken. I.e. the extra complexity is not worth carrying. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Extract checking of guest's DEBUGCTL into helperSean Christopherson
Move VMX's logic to check DEBUGCTL values into a standalone helper so that the code can be used by nested VM-Enter to apply the same logic to the value being loaded from vmcs12. KVM needs to explicitly check vmcs12->guest_ia32_debugctl on nested VM-Enter, as hardware may support features that KVM does not, i.e. relying on hardware to detect invalid guest state will result in false negatives. Unfortunately, that means applying KVM's funky suppression of BTF and LBR to vmcs12 so as not to break existing guests. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supportedSean Christopherson
Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the guest CPUID model, as debug support is supposed to be available if RTM is supported, and there are no known downsides to letting the guest debug RTM aborts. Note, there are no known bug reports related to RTM_DEBUG, the primary motivation is to reduce the probability of breaking existing guests when a future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL (KVM currently lets L2 run with whatever hardware supports; whoops). Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to DR7.RTM. Fixes: 83c529151ab0 ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new KVM_RUN flagSean Christopherson
Instruct vendor code to load the guest's DR6 into hardware via a new KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to load vcpu->arch.dr6 into hardware when DR6 can be read/written directly by the guest. Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM thinks DR6 needs to be reloaded. TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmapSean Christopherson
Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter into an a generic bitmap so that similar "take action" information can be passed to vendor code without creating a pile of boolean parameters. This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and will also allow for adding similar functionality for re-loading debugctl in the active VMCS. Opportunistically massage the TDX WARN and comment to prepare for adding more run_flags, all of which are expected to be mutually exclusive with TDX, i.e. should be WARNed on. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: TDX: Use kvm_arch_vcpu.host_debugctl to restore the host's DEBUGCTLSean Christopherson
Use the kvm_arch_vcpu.host_debugctl snapshot to restore DEBUGCTL after running a TD vCPU. The final TDX series rebase was mishandled, likely due to commit fb71c7959356 ("KVM: x86: Snapshot the host's DEBUGCTL in common x86") deleting the same line of code from vmx.h, i.e. creating a semantic conflict of sorts, but no syntactic conflict. Using the version in kvm_vcpu_arch picks up the ulong => u64 fix (which isn't relevant to TDX) as well as the IRQ fix from commit 189ecdb3e112 ("KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs"). Link: https://lore.kernel.org/all/20250307212053.2948340-10-pbonzini@redhat.com Cc: Adrian Hunter <adrian.hunter@intel.com> Fixes: 8af099037527 ("KVM: TDX: Save and restore IA32_DEBUGCTL") Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: TDX: Report supported optional TDVMCALLs in TDX capabilitiesPaolo Bonzini
Allow userspace to advertise TDG.VP.VMCALL subfunctions that the kernel also supports. For each output register of GetTdVmCallInfo's leaf 1, add two fields to KVM_TDX_CAPABILITIES: one for kernel-supported TDVMCALLs (userspace can set those blindly) and one for user-supported TDVMCALLs (userspace can set those if it knows how to handle them). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Exit to userspace for SetupEventNotifyInterruptPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Exit to userspace for GetTdVmCallInfoBinbin Wu
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX, to allow userspace to provide information about the support of TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API. GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>, <ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>, <Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and <Instruction.WRMSR>. They must be supported by VMM to support TDX guests. For GetTdVmCallInfo - When leaf (r12) to enumerate TDVMCALL functionality is set to 0, successful execution indicates all GHCI base TDVMCALLs listed above are supported. Update the KVM TDX document with the set of the GHCI base APIs. - When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it indicates the TDX guest is querying the supported TDVMCALLs beyond the GHCI base TDVMCALLs. Exit to userspace to let userspace set the TDVMCALL sub-function bit(s) accordingly to the leaf outputs. KVM could set the TDVMCALL bit(s) supported by itself when the TDVMCALLs don't need support from userspace after returning from userspace and before entering guest. Currently, no such TDVMCALLs implemented, KVM just sets the values returned from userspace. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20x86/efi: Move runtime service initialization to arch/x86Alexander Shishkin
The EFI call in start_kernel() is guarded by #ifdef CONFIG_X86. Move the thing to the arch_cpu_finalize_init() path on x86 and get rid of the #ifdef in start_kernel(). No functional change intended. Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/all/20250620135325.3300848-5-kirill.shutemov%40linux.intel.com
2025-06-20KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>Binbin Wu
Handle TDVMCALL for GetQuote to generate a TD-Quote. GetQuote is a doorbell-like interface used by TDX guests to request VMM to generate a TD-Quote signed by a service hosting TD-Quoting Enclave operating on the host. A TDX guest passes a TD Report (TDREPORT_STRUCT) in a shared-memory area as parameter. Host VMM can access it and queue the operation for a service hosting TD-Quoting enclave. When completed, the Quote is returned via the same shared-memory area. KVM only checks the GPA from the TDX guest has the shared-bit set and drops the shared-bit before exiting to userspace to avoid bleeding the shared-bit into KVM's exit ABI. KVM forwards the request to userspace VMM (e.g. QEMU) and userspace VMM queues the operation asynchronously. KVM sets the return code according to the 'ret' field set by userspace to notify the TDX guest whether the request has been queued successfully or not. When the request has been queued successfully, the TDX guest can poll the status field in the shared-memory area to check whether the Quote generation is completed or not. When completed, the generated Quote is returned via the same buffer. Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is required to handle the KVM exit reason as the initial support for TDX, by reentering KVM to ensure that the TDVMCALL is complete. While at it, add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Add new TDVMCALL status code for unsupported subfuncsBinbin Wu
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and return it for unimplemented TDVMCALL subfunctions. Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not implemented is vague because TDX guests can't tell the error is due to the subfunction is not supported or an invalid input of the subfunction. New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND. Before the change, for common guest implementations, when a TDX guest receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases: 1. Some operand is invalid. It could change the operand to another value retry. 2. The subfunction is not supported. For case 1, an invalid operand usually means the guest implementation bug. Since the TDX guest can't tell which case is, the best practice for handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf, treating the failure as fatal if the TDVMCALL is essential or ignoring it if the TDVMCALL is optional. With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to old TDX guest that do not know about it, but it is expected that the guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND. Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND specifically; for example Linux just checks for success. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20Merge tag 'drm-misc-next-2025-06-19' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for 6.17: UAPI Changes: - Add Task Information for the wedge API Cross-subsystem Changes: Core Changes: - Fix warnings related to export.h - fbdev: Make CONFIG_FIRMWARE_EDID available on all architectures - fence: Fix UAF issues - format-helper: Improve tests Driver Changes: - ivpu: Add turbo flag, Add Wildcat Lake Support - rz-du: Improve MIPI-DSI Support - vmwgfx: fence improvement Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://lore.kernel.org/r/20250619-perfect-industrious-whippet-8ed3db@houat
2025-06-18x86/sev: Drop unnecessary parameter in snp_issue_guest_request()Alexey Kardashevskiy
Commit 3e385c0d6ce8 ("virt: sev-guest: Move SNP Guest Request data pages handling under snp_cmd_mutex") moved @input from snp_msg_desc to snp_guest_req which is passed to snp_issue_guest_request(). Drop the extra parameter. No functional change intended. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Dionna Glaze <dionnaglaze@google.com> Link: https://lore.kernel.org/20250611040842.2667262-5-aik@amd.com
2025-06-18x86/sev: Document requirement for linear mapping of guest request buffersAlexey Kardashevskiy
The Guest Request supports 3 types of messages now, the largest is the extended variant of MSG_REPORT_REQ: sizeof(snp_ext_report_req)==112. These used to be allocated on stack and then moved to the SNP guest platform device (snp_guest_dev) for the reason explained in db10cb9b5746 ("virt: sevguest: Fix passing a stack buffer as a scatterlist target"): aesgcm_encrypt() and aesgcm_decrypt() are used for guest messages and might potentially use a crypto accelerator which requires DMA buffers to be in the linear mapping. Add a comment, warn and return an error when the buffers are not in linear mapping. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Dionna Glaze <dionnaglaze@google.com> Link: https://lore.kernel.org/20250611040842.2667262-4-aik@amd.com
2025-06-18x86/sev: Allocate request in TSC_INFO_REQ on stackAlexey Kardashevskiy
Allocate a 88 byte request structure on stack and skip needless kzalloc/kfree. While at this, correct indent. No functional change intended. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Dionna Glaze <dionnaglaze@google.com> Link: https://lore.kernel.org/20250611040842.2667262-3-aik@amd.com
2025-06-18virt: sev-guest: Contain snp_guest_request_ioctl in sev-guestAlexey Kardashevskiy
SNP Guest Request uses only exitinfo2 which is a return value from GHCB, has meaning beyond ioctl and therefore belongs to struct snp_guest_req. Move exitinfo2 there and remove snp_guest_request_ioctl from the SEV platform code. No functional change intended. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Dionna Glaze <dionnaglaze@google.com> Link: https://lore.kernel.org/20250611040842.2667262-2-aik@amd.com
2025-06-18x86/alternatives: Fix int3 handling failure from broken text_poke arrayMasami Hiramatsu (Google)
Since smp_text_poke_single() does not expect there is another text_poke request is queued, it can make text_poke_array not sorted or cause a buffer overflow on the text_poke_array.vec[]. This will cause an Oops in int3 because of bsearch failing; CPU 0 CPU 1 CPU 2 ----- ----- ----- smp_text_poke_batch_add() smp_text_poke_single() <<-- Adds out of order <int3> [Fails o find address in text_poke_array ] OOPS! Or unhandled page fault because of a buffer overflow; CPU 0 CPU 1 ----- ----- smp_text_poke_batch_add() <<+ ... | smp_text_poke_batch_add() <<-- Adds TEXT_POKE_ARRAY_MAX times. smp_text_poke_single() { __smp_text_poke_batch_add() <<-- Adds entry at TEXT_POKE_ARRAY_MAX + 1 smp_text_poke_batch_finish() [Unhandled page fault because text_poke_array.nr_entries is overwritten] BUG! } Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add() so that it correctly flush the queue if needed. Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg+4w@mail.gmail.com/ Fixes: c8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Link: https://lkml.kernel.org/r/\ 175020512308.3582717.13631440385506146631.stgit@mhiramat.tok.corp.google.com
2025-06-17x86/mm: Fix early boot use of INVPLGBRik van Riel
The INVLPGB instruction has limits on how many pages it can invalidate at once. That limit is enumerated in CPUID, read by the kernel, and stored in 'invpgb_count_max'. Ranged invalidation, like invlpgb_kernel_range_flush() break up their invalidations so that they do not exceed the limit. However, early boot code currently attempts to do ranged invalidation before populating 'invlpgb_count_max'. There is a for loop which is basically: for (...; addr < end; addr += invlpgb_count_max*PAGE_SIZE) If invlpgb_kernel_range_flush is called before the kernel has read the value of invlpgb_count_max from the hardware, the normally bounded loop can become an infinite loop if invlpgb_count_max is initialized to zero. Fix that issue by initializing invlpgb_count_max to 1. This way INVPLGB at early boot time will be a little bit slower than normal (with initialized invplgb_count_max), and not an instant hang at bootup time. Fixes: b7aa05cbdc52 ("x86/mm: Add INVLPGB support code") Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250606171112.4013261-3-riel%40surriel.com