summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2024-10-04KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)Sean Christopherson
[ Upstream commit 73b42dc69be8564d4951a14d00f827929fe5ef79 ] Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's IPI virtualization support, but only for AMD. While not stated anywhere in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled, KVM needs to match CPU behavior as some ICR ICR writes will be handled by the CPU, not by KVM. Add a kvm_x86_ops knob to control the underlying format used by the CPU to store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage format for x2APIC mode doesn't matter, and having the behavior follow AMD versus Intel will provide better test coverage and ease debugging. Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-04KVM: x86: Make x2APIC ID 100% readonlySean Christopherson
[ Upstream commit 4b7c3f6d04bd53f2e5b228b6821fb8f5d1ba3071 ] Ignore the userspace provided x2APIC ID when fixing up APIC state for KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM. Commit a92e2543d6a8 ("KVM: x86: use hardware-compatible format for APIC ID register"), which added the fixup, didn't intend to allow userspace to modify the x2APIC ID. In fact, that commit is when KVM first started treating the x2APIC ID as readonly, apparently to fix some race: static inline u32 kvm_apic_id(struct kvm_lapic *apic) { - return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff; + /* To avoid a race between apic_base and following APIC_ID update when + * switching to x2apic_mode, the x2apic mode returns initial x2apic id. + */ + if (apic_x2apic_mode(apic)) + return apic->vcpu->vcpu_id; + + return kvm_lapic_get_reg(apic, APIC_ID) >> 24; } Furthermore, KVM doesn't support delivering interrupts to vCPUs with a modified x2APIC ID, but KVM *does* return the modified value on a guest RDMSR and for KVM_GET_LAPIC. I.e. no remotely sane setup can actually work with a modified x2APIC ID. Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map calculation, which expects the LDR to align with the x2APIC ID. WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm] CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014 RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm] Call Trace: <TASK> kvm_apic_set_state+0x1cf/0x5b0 [kvm] kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm] kvm_vcpu_ioctl+0x663/0x8a0 [kvm] __x64_sys_ioctl+0xb8/0xf0 do_syscall_64+0x56/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fade8b9dd6f Unfortunately, the WARN can still trigger for other CPUs than the current one by racing against KVM_SET_LAPIC, so remove it completely. Reported-by: Michal Luczaj <mhal@rbox.co> Closes: https://lore.kernel.org/all/814baa0c-1eaa-4503-129f-059917365e80@rbox.co Reported-by: Haoyu Wu <haoyuwu254@gmail.com> Closes: https://lore.kernel.org/all/20240126161633.62529-1-haoyuwu254@gmail.com Reported-by: syzbot+545f1326f405db4e1c3e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000c2a6b9061cbca3c3@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240802202941.344889-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: 73b42dc69be8 ("KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-04KVM: x86: Drop unused check_apicv_inhibit_reasons() callback definitionHou Wenlong
[ Upstream commit c7d4c5f01961cdc4f1d29525e2b0d71f62c5bc33 ] The check_apicv_inhibit_reasons() callback implementation was dropped in the commit b3f257a84696 ("KVM: x86: Track required APICv inhibits with variable, not callback"), but the definition removal was missed in the final version patch (it was removed in the v4). Therefore, it should be dropped, and the vmx_check_apicv_inhibit_reasons() function declaration should also be removed. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com> Stable-dep-of: 73b42dc69be8 ("KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-04KVM: x86: Move x2APIC ICR helper above kvm_apic_write_nodecode()Sean Christopherson
commit d33234342f8b468e719e05649fd26549fb37ef8a upstream. Hoist kvm_x2apic_icr_write() above kvm_apic_write_nodecode() so that a local helper to _read_ the x2APIC ICR can be added and used in the nodecode path without needing a forward declaration. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-04KVM: x86: Enforce x2APIC's must-be-zero reserved ICR bitsSean Christopherson
commit 71bf395a276f0578d19e0ae137a7d1d816d08e0e upstream. Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that are must-be-zero on both Intel and AMD, i.e. any reserved bits other than the BUSY bit, which Intel ignores and basically says is undefined. KVM's xapic_state_test selftest has been fudging the bug since commit 4b88b1a518b3 ("KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode"), which essentially removed the testcase instead of fixing the bug. WARN if the nodecode path triggers a #GP, as the CPU is supposed to check reserved bits for ICR when it's partially virtualized. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missingRavi Bangoria
commit 54950bfe2b69cdc06ef753872b5225e54eb73506 upstream. If host supports Bus Lock Detect, KVM advertises it to guests even if SVM support is absent. Additionally, guest wouldn't be able to use it despite guest CPUID bit being set. Fix it by unconditionally clearing the feature bit in KVM cpu capability. Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/r/CALMp9eRet6+v8Y1Q-i6mqPm4hUow_kJNhmVHfOV8tMfuSS=tVg@mail.gmail.com Fixes: 76ea438b4afc ("KVM: X86: Expose bus lock debug exception to guest") Cc: stable@vger.kernel.org Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240808062937.1149-4-ravi.bangoria@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASEMaxim Levitsky
commit dad1613e0533b380318281c1519e1a3477c2d0d2 upstream. If these msrs are read by the emulator (e.g due to 'force emulation' prefix), SVM code currently fails to extract the corresponding segment bases, and return them to the emulator. Fix that. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240802151608.72896-3-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTSSean Christopherson
commit 4bcdd831d9d01e0fb64faea50732b59b2ee88da1 upstream. Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX reads guest memory. Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN via sync_regs(), which already holds SRCU. I.e. trying to precisely use kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause problems. Acquiring SRCU isn't all that expensive, so for simplicity, grab it unconditionally for KVM_SET_VCPU_EVENTS. ============================= WARNING: suspicious RCU usage 6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted ----------------------------- include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by repro/1071: #0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm] stack backtrace: CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x13f/0x1a0 kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel] load_vmcs12_host_state+0x432/0xb40 [kvm_intel] vmx_leave_nested+0x30/0x40 [kvm_intel] kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm] kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm] ? mark_held_locks+0x49/0x70 ? kvm_vcpu_ioctl+0x7d/0x970 [kvm] ? kvm_vcpu_ioctl+0x497/0x970 [kvm] kvm_vcpu_ioctl+0x497/0x970 [kvm] ? lock_acquire+0xba/0x2d0 ? find_held_lock+0x2b/0x80 ? do_user_addr_fault+0x40c/0x6f0 ? lock_release+0xb7/0x270 __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7ff11eb1b539 </TASK> Fixes: f7e570780efc ("KVM: x86: Forcibly leave nested virt when SMM state is toggled") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-03KVM: nVMX: Fold requested virtual interrupt check into has_nested_events()Sean Christopherson
commit 321ef62b0c5f6f57bb8500a2ca5986052675abbf upstream. Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is pending delivery, in vmx_has_nested_events() and drop the one-off kvm_x86_ops.guest_apic_has_interrupt() hook. In addition to dropping a superfluous hook, this fixes a bug where KVM would incorrectly treat virtual interrupts _for L2_ as always enabled due to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(), treating IRQs as enabled if L2 is active and vmcs12 is configured to exit on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake event based on L1's IRQ blocking status. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-03KVM: nVMX: Check for pending posted interrupts when looking for nested eventsSean Christopherson
commit 27c4fa42b11af780d49ce704f7fa67b3c2544df4 upstream. Check for pending (and notified!) posted interrupts when checking if L2 has a pending wake event, as fully posted/notified virtual interrupt is a valid wake event for HLT. Note that KVM must check vmx->nested.pi_pending to avoid prematurely waking L2, e.g. even if KVM sees a non-zero PID.PIR and PID.0N=1, the virtual interrupt won't actually be recognized until a notification IRQ is received by the vCPU or the vCPU does (nested) VM-Enter. Fixes: 26844fee6ade ("KVM: x86: never write to memory from kvm_vcpu_check_block()") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/all/20231207010302.2240506-1-jmattson@google.com Link: https://lore.kernel.org/r/20240607172609.3205077-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-03KVM: nVMX: Request immediate exit iff pending nested event needs injectionSean Christopherson
commit 32f55e475ce2c4b8b124d335fcfaf1152ba977a1 upstream. When requesting an immediate exit from L2 in order to inject a pending event, do so only if the pending event actually requires manual injection, i.e. if and only if KVM actually needs to regain control in order to deliver the event. Avoiding the "immediate exit" isn't simply an optimization, it's necessary to make forward progress, as the "already expired" VMX preemption timer trick that KVM uses to force a VM-Exit has higher priority than events that aren't directly injected. At present time, this is a glorified nop as all events processed by vmx_has_nested_events() require injection, but that will not hold true in the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI. I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired VMX preemption timer will trigger VM-Exit before the virtual interrupt is delivered, and KVM will effectively hang the vCPU in an endless loop of forced immediate VM-Exits (because the pending virtual interrupt never goes away). Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-03KVM: nVMX: Add a helper to get highest pending from Posted Interrupt vectorSean Christopherson
commit d83c36d822be44db4bad0c43bea99c8908f54117 upstream. Add a helper to retrieve the highest pending vector given a Posted Interrupt descriptor. While the actual operation is straightforward, it's surprisingly easy to mess up, e.g. if one tries to reuse lapic.c's find_highest_vector(), which doesn't work with PID.PIR due to the APIC's IRR and ISR component registers being physically discontiguous (they're 4-byte registers aligned at 16-byte intervals). To make PIR handling more consistent with respect to IRR and ISR handling, return -1 to indicate "no interrupt pending". Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-03KVM: VMX: Split out the non-virtualization part of vmx_interrupt_blocked()Sean Christopherson
commit 322a569c4b4188a0da2812f9e952780ce09b74ba upstream. Move the non-VMX chunk of the "interrupt blocked" checks to a separate helper so that KVM can reuse the code to detect if interrupts are blocked for L2, e.g. to determine if a virtual interrupt _for L2_ is a valid wake event. If L1 disables HLT-exiting for L2, nested APICv is enabled, and L2 HLTs, then L2 virtual interrupts are valid wake events, but if and only if interrupts are unblocked for L2. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-21KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guestsMichael Roth
With commit 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption"), older VMMs like QEMU 9.0 and older will fail when booting SEV-ES guests with something like the following error: qemu-system-x86_64: error: failed to get MSR 0x174 qemu-system-x86_64: ../qemu.git/target/i386/kvm/kvm.c:3950: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. This is because older VMMs that might still call svm_get_msr()/svm_set_msr() for SEV-ES guests after guest boot even if those interfaces were essentially just noops because of the vCPU state being encrypted and stored separately in the VMSA. Now those VMMs will get an -EINVAL and generally crash. Newer VMMs that are aware of KVM_SEV_INIT2 however are already aware of the stricter limitations of what vCPU state can be sync'd during guest run-time, so newer QEMU for instance will work both for legacy KVM_SEV_ES_INIT interface as well as KVM_SEV_INIT2. So when using KVM_SEV_INIT2 it's okay to assume userspace can deal with -EINVAL, whereas for legacy KVM_SEV_ES_INIT the kernel might be dealing with either an older VMM and so it needs to assume that returning -EINVAL might break the VMM. Address this by only returning -EINVAL if the guest was started with KVM_SEV_INIT2. Otherwise, just silently return. Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Nikunj A Dadhania <nikunj@amd.com> Reported-by: Srikanth Aithal <sraithal@amd.com> Closes: https://lore.kernel.org/lkml/37usuu4yu4ok7be2hqexhmcyopluuiqj3k266z4gajc2rcj4yo@eujb23qc3zcm/ Fixes: 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption") Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240604233510.764949-1-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-20KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routesSean Christopherson
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC routes, irrespective of whether the I/O APIC is emulated by userspace or by KVM. If a level-triggered interrupt routed through the I/O APIC is pending or in-service for a vCPU, KVM needs to intercept EOIs on said vCPU even if the vCPU isn't the destination for the new routing, e.g. if servicing an interrupt using the old routing races with I/O APIC reconfiguration. Commit fceb3a36c29a ("KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race") fixed the common cases, but kvm_apic_pending_eoi() only checks if an interrupt is in the local APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is pending in the PIR. Failure to intercept EOI can manifest as guest hangs with Windows 11 if the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't expose a more modern form of time to the guest. Cc: stable@vger.kernel.org Cc: Adamos Ttofari <attofari@amazon.de> Cc: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240611014845.82795-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-05KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attrTao Su
Drop the second snapshot of mmu_invalidate_seq in kvm_faultin_pfn(). Before checking the mismatch of private vs. shared, mmu_invalidate_seq is saved to fault->mmu_seq, which can be used to detect an invalidation related to the gfn occurred, i.e. KVM will not install a mapping in page table if fault->mmu_seq != mmu_invalidate_seq. Currently there is a second snapshot of mmu_invalidate_seq, which may not be same as the first snapshot in kvm_faultin_pfn(), i.e. the gfn attribute may be changed between the two snapshots, but the gfn may be mapped in page table without hindrance. Therefore, drop the second snapshot as it has no obvious benefits. Fixes: f6adeae81f35 ("KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()") Signed-off-by: Tao Su <tao1.su@linux.intel.com> Message-ID: <20240528102234.2162763-1-tao1.su@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03Merge branch 'kvm-fixes-6.10-1' into HEADPaolo Bonzini
* Fixes and debugging help for the #VE sanity check. Also disable it by default, even for CONFIG_DEBUG_KERNEL, because it was found to trigger spuriously (most likely a processor erratum as the exact symptoms vary by generation). * Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled situation (GIF=0 or interrupt shadow) when the processor supports virtual NMI. While generally KVM will not request an NMI window when virtual NMIs are supported, in this case it *does* have to single-step over the interrupt shadow or enable the STGI intercept, in order to deliver the latched second NMI. * Drop support for hand tuning APIC timer advancement from userspace. Since we have adaptive tuning, and it has proved to work well, drop the module parameter for manual configuration and with it a few stupid bugs that it had.
2024-06-03KVM: x86: Drop support for hand tuning APIC timer advancement from userspaceSean Christopherson
Remove support for specifying a static local APIC timer advancement value, and instead present a read-only boolean parameter to let userspace enable or disable KVM's dynamic APIC timer advancement. Realistically, it's all but impossible for userspace to specify an advancement that is more precise than what KVM's adaptive tuning can provide. E.g. a static value needs to be tuned for the exact hardware and kernel, and if KVM is using hrtimers, likely requires additional tuning for the exact configuration of the entire system. Dropping support for a userspace provided value also fixes several flaws in the interface. E.g. KVM interprets a negative value other than -1 as a large advancement, toggling between a negative and positive value yields unpredictable behavior as vCPUs will switch from dynamic to static advancement, changing the advancement in the middle of VM creation can result in different values for vCPUs within a VM, etc. Those flaws are mostly fixable, but there's almost no justification for taking on yet more complexity (it's minimal complexity, but still non-zero). The only arguments against using KVM's adaptive tuning is if a setup needs a higher maximum, or if the adjustments are too reactive, but those are arguments for letting userspace control the absolute max advancement and the granularity of each adjustment, e.g. similar to how KVM provides knobs for halt polling. Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com Cc: Shuling Zhou <zhoushuling@huawei.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522010304.1650603-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03KVM: SEV-ES: Delegate LBR virtualization to the processorRavi Bangoria
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES guests. Although KVM currently enforces LBRV for SEV-ES guests, there are multiple issues with it: o MSR_IA32_DEBUGCTLMSR is still intercepted. Since MSR_IA32_DEBUGCTLMSR interception is used to dynamically toggle LBRV for performance reasons, this can be fatal for SEV-ES guests. For ex SEV-ES guest on Zen3: [guest ~]# wrmsr 0x1d9 0x4 KVM: entry failed, hardware error 0xffffffff EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000 Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests. No additional save/restore logic is required since MSR_IA32_DEBUGCTLMSR is of swap type A. o KVM will disable LBRV if userspace sets MSR_IA32_DEBUGCTLMSR before the VMSA is encrypted. Fix this by moving LBRV enablement code post VMSA encryption. [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June 2023, Vol 2, 15.35.2 Enabling SEV-ES. https://bugzilla.kernel.org/attachment.cgi?id=304653 Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Co-developed-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-4-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03KVM: SEV-ES: Disallow SEV-ES guests when X86_FEATURE_LBRV is absentRavi Bangoria
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES guests. So, prevent SEV-ES guests when LBRV support is missing. [1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June 2023, Vol 2, 15.35.2 Enabling SEV-ES. https://bugzilla.kernel.org/attachment.cgi?id=304653 Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-3-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-03KVM: SEV-ES: Prevent MSR access post VMSA encryptionNikunj A Dadhania
KVM currently allows userspace to read/write MSRs even after the VMSA is encrypted. This can cause unintentional issues if MSR access has side- effects. For ex, while migrating a guest, userspace could attempt to migrate MSR_IA32_DEBUGCTLMSR and end up unintentionally disabling LBRV on the target. Fix this by preventing access to those MSRs which are context switched via the VMSA, once the VMSA is encrypted. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Message-ID: <20240531044644.768-2-ravi.bangoria@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: SVM: WARN on vNMI + NMI window iff NMIs are outright maskedSean Christopherson
When requesting an NMI window, WARN on vNMI support being enabled if and only if NMIs are actually masked, i.e. if the vCPU is already handling an NMI. KVM's ABI for NMIs that arrive simultanesouly (from KVM's point of view) is to inject one NMI and pend the other. When using vNMI, KVM pends the second NMI simply by setting V_NMI_PENDING, and lets the CPU do the rest (hardware automatically sets V_NMI_BLOCKING when an NMI is injected). However, if KVM can't immediately inject an NMI, e.g. because the vCPU is in an STI shadow or is running with GIF=0, then KVM will request an NMI window and trigger the WARN (but still function correctly). Whether or not the GIF=0 case makes sense is debatable, as the intent of KVM's behavior is to provide functionality that is as close to real hardware as possible. E.g. if two NMIs are sent in quick succession, the probability of both NMIs arriving in an STI shadow is infinitesimally low on real hardware, but significantly larger in a virtual environment, e.g. if the vCPU is preempted in the STI shadow. For GIF=0, the argument isn't as clear cut, because the window where two NMIs can collide is much larger in bare metal (though still small). That said, KVM should not have divergent behavior for the GIF=0 case based on whether or not vNMI support is enabled. And KVM has allowed simultaneous NMIs with GIF=0 for over a decade, since commit 7460fb4a3400 ("KVM: Fix simultaneous NMIs"). I.e. KVM's GIF=0 handling shouldn't be modified without a *really* good reason to do so, and if KVM's behavior were to be modified, it should be done irrespective of vNMI support. Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI") Cc: stable@vger.kernel.org Cc: Santosh Shukla <Santosh.Shukla@amd.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522021435.1684366-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86: Force KVM_WERROR if the global WERROR is enabledSean Christopherson
Force KVM_WERROR if the global WERROR is enabled to avoid pestering the user about a Kconfig that will ultimately be ignored. Force KVM_WERROR instead of making it mutually exclusive with WERROR to avoid generating a .config builds KVM with -Werror, but has KVM_WERROR=n. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240517180341.974251-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86: Disable KVM_INTEL_PROVE_VE by defaultSean Christopherson
Disable KVM's "prove #VE" support by default, as it provides no functional value, and even its sanity checking benefits are relatively limited. I.e. it should be fully opt-in even on debug kernels, especially since EPT Violation #VE suppression appears to be buggy on some CPUs. Opportunistically add a line in the help text to make it abundantly clear that KVM_INTEL_PROVE_VE should never be enabled in a production environment. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86/mmu: Print SPTEs on unexpected #VESean Christopherson
Print the SPTEs that correspond to the faulting GPA on an unexpected EPT Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE didn't have SUPPRESS_VE set. Opportunistically assert that the underlying exit reason was indeed an EPT Violation, as the CPU has *really* gone off the rails if a #VE occurs due to a completely unexpected exit reason. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Dump VMCS on unexpected #VESean Christopherson
Dump the VMCS on an unexpected #VE, otherwise it's practically impossible to figure out why the #VE occurred. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86/mmu: Add sanity checks that KVM doesn't create EPT #VE SPTEsSean Christopherson
Assert that KVM doesn't set a SPTE to a value that could trigger an EPT Violation #VE on a non-MMIO SPTE, e.g. to help detect bugs even without KVM_INTEL_PROVE_VE enabled, and to help debug actual #VE failures. Note, this will run afoul of TDX support, which needs to reflect emulated MMIO accesses into the guest as #VEs (which was the whole point of adding EPT Violation #VE support in KVM). The obvious fix for that is to exempt MMIO SPTEs, but that's annoyingly difficult now that is_mmio_spte() relies on a per-VM value. However, resolving that conundrum is a future problem, whereas getting KVM_INTEL_PROVE_VE healthy is a current problem. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: nVMX: Always handle #VEs in L0 (never forward #VEs from L2 to L1)Sean Christopherson
Always handle #VEs, e.g. due to prove EPT Violation #VE failures, in L0, as KVM does not expose any #VE capabilities to L1, i.e. any and all #VEs are KVM's responsibility. Fixes: 8131cf5b4fd8 ("KVM: VMX: Introduce test mode related to EPT violation VE") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: nVMX: Initialize #VE info page for vmcs02 when proving #VE supportSean Christopherson
Point vmcs02.VE_INFORMATION_ADDRESS at the vCPU's #VE info page when initializing vmcs02, otherwise KVM will run L2 with EPT Violation #VE enabled and a VE info address pointing at pfn 0. Fixes: 8131cf5b4fd8 ("KVM: VMX: Introduce test mode related to EPT violation VE") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Don't kill the VM on an unexpected #VESean Christopherson
Don't terminate the VM on an unexpected #VE, as it's extremely unlikely the #VE is fatal to the guest, and even less likely that it presents a danger to the host. Simply resume the guest on "failure", as the #VE info page's BUSY field will prevent converting any more EPT Violations to #VEs for the vCPU (at least, that's what the BUSY field is supposed to do). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86/mmu: Use SHADOW_NONPRESENT_VALUE for atomic zap in TDP MMUIsaku Yamahata
Use SHADOW_NONPRESENT_VALUE when zapping TDP MMU SPTEs with mmu_lock held for read, tdp_mmu_zap_spte_atomic() was simply missed during the initial development. Fixes: 7f01cab84928 ("KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE") Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [sean: write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240518000430.1118488-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "ARM: - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greatly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements. LoongArch: - Add ParaVirt IPI support - Add software breakpoint support - Add mmio trace events support RISC-V: - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak - Some preparatory work for both TDX and SNP page fault handling. This also cleans up the page fault path, so that the priorities of various kinds of fauls (private page, no memory, write to read-only slot, etc.) are easier to follow. x86: - Minimize amount of time that shadow PTEs remain in the special REMOVED_SPTE state. This is a state where the mmu_lock is held for reading but concurrent accesses to the PTE have to spin; shortening its use allows other vCPUs to repopulate the zapped region while the zapper finishes tearing down the old, defunct page tables. - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is defined by hardware but left for software use. This lets KVM communicate its inability to map GPAs that set bits 51:48 on hosts without 5-level nested page tables. Guest firmware is expected to use the information when mapping BARs; this avoids that they end up at a legal, but unmappable, GPA. - Fixed a bug where KVM would not reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - As usual, a bunch of code cleanups. x86 (AMD): - Implement a new and improved API to initialize SEV and SEV-ES VMs, which will also be extendable to SEV-SNP. The new API specifies the desired encryption in KVM_CREATE_VM and then separately initializes the VM. The new API also allows customizing the desired set of VMSA features; the features affect the measurement of the VM's initial state, and therefore enabling them cannot be done tout court by the hypervisor. While at it, the new API includes two bugfixes that couldn't be applied to the old one without a flag day in userspace or without affecting the initial measurement. When a SEV-ES VM is created with the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected once the VMSA has been encrypted. Also, the FPU and AVX state will be synchronized and encrypted too. - Support for GHCB version 2 as applicable to SEV-ES guests. This, once more, is only accessible when using the new KVM_SEV_INIT2 flow for initialization of SEV-ES VMs. x86 (Intel): - An initial bunch of prerequisite patches for Intel TDX were merged. They generally don't do anything interesting. The only somewhat user visible change is a new debugging mode that checks that KVM's MMU never triggers a #VE virtualization exception in the guest. - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. Generic: - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Remove .change_pte() MMU notifier - the changes to non-KVM code are small and Andrew Morton asked that I also take those through the KVM tree. The callback was only ever implemented by KVM (which was also the original user of MMU notifiers) but it had been nonfunctional ever since calls to set_pte_at_notify were wrapped with invalidate_range_start and invalidate_range_end... in 2012. Selftests: - Enhance the demand paging test to allow for better reporting and stressing of UFFD performance. - Convert the steal time test to generate TAP-friendly output. - Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed time across two different clock domains. - Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT. - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell script, to play nice with running in a minimal userspace environment. - Allow skipping the RSEQ test's sanity check that the vCPU was able to complete a reasonable number of KVM_RUNs, as the assert can fail on a completely valid setup. If the test is run on a large-ish system that is otherwise idle, and the test isn't affined to a low-ish number of CPUs, the vCPU task can be repeatedly migrated to CPUs that are in deep sleep states, which results in the vCPU having very little net runtime before the next migration due to high wakeup latencies. - Define _GNU_SOURCE for all selftests to fix a warning that was introduced by a change to kselftest_harness.h late in the 6.9 cycle, and because forcing every test to #define _GNU_SOURCE is painful. - Provide a global pseudo-RNG instance for all tests, so that library code can generate random, but determinstic numbers. - Use the global pRNG to randomly force emulation of select writes from guest code on x86, e.g. to help validate KVM's emulation of locked accesses. - Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception handlers at VM creation, instead of forcing tests to manually trigger the related setup. Documentation: - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits) selftests/kvm: remove dead file KVM: selftests: arm64: Test vCPU-scoped feature ID registers KVM: selftests: arm64: Test that feature ID regs survive a reset KVM: selftests: arm64: Store expected register value in set_id_regs KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope KVM: arm64: Only reset vCPU-scoped feature ID regs once KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs() KVM: arm64: Rename is_id_reg() to imply VM scope KVM: arm64: Destroy mpidr_data for 'late' vCPU creation KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support KVM: arm64: Fix hvhe/nvhe early alias parsing KVM: SEV: Allow per-guest configuration of GHCB protocol version KVM: SEV: Add GHCB handling for termination requests KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests KVM: SEV: Add support to handle AP reset MSR protocol KVM: x86: Explicitly zero kvm_caps during vendor module load KVM: x86: Fully re-initialize supported_mce_cap on vendor module load KVM: x86: Fully re-initialize supported_vm_types on vendor module load KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values ...
2024-05-12Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.10: - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is unused by hardware, so that KVM can communicate its inability to map GPAs that set bits 51:48 due to lack of 5-level paging. Guest firmware is expected to use the information to safely remap BARs in the uppermost GPA space, i.e to avoid placing a BAR at a legal, but unmappable, GPA. - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Don't completely ignore same-value writes to immutable feature MSRs, as doing so results in KVM failing to reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled KVM-wide to avoid confusing debuggers (KVM will never bother clearing the ABSENT inhibit, even if userspace enables in-kernel local APIC).
2024-05-12Merge tag 'kvm-x86-mmu-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.10: - Process TDP MMU SPTEs that are are zapped while holding mmu_lock for read after replacing REMOVED_SPTE with '0' and flushing remote TLBs, which allows vCPU tasks to repopulate the zapped region while the zapper finishes tearing down the old, defunct page tables. - Fix a longstanding, likely benign-in-practice race where KVM could fail to detect a write from kvm_mmu_track_write() to a shadowed GPTE if the GPTE is first page table being shadowed.
2024-05-12Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.10: - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is used only when synthesizing nested EPT violation, i.e. it's not the vCPU's "real" exit_qualification, which is tracked elsewhere. - Add a sanity check to assert that EPT Violations are the only sources of nested PML Full VM-Exits.
2024-05-10Merge tag 'loongarch-kvm-6.10' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.
2024-05-10Merge branch 'kvm-sev-es-ghcbv2' into HEADPaolo Bonzini
While the main additions from GHCB protocol version 1 to version 2 revolve mostly around SEV-SNP support, there are a number of changes applicable to SEV-ES guests as well. Pluck a handful patches from the SNP hypervisor patchset for GHCB-related changes that are also applicable to SEV-ES. A KVM_SEV_INIT2 field lets userspace can control the maximum GHCB protocol version advertised to guests and manage compatibility across kernels/versions.
2024-05-10Merge branch 'kvm-coco-pagefault-prep' into HEADPaolo Bonzini
A combination of prep work for TDX and SNP, and a clean up of the page fault path to (hopefully) make it easier to follow the rules for private memory, noslot faults, writes to read-only slots, etc.
2024-05-10Merge branch 'kvm-vmx-ve' into HEADPaolo Bonzini
Allow a non-zero value for non-present SPTE and removed SPTE, so that TDX can set the "suppress VE" bit.
2024-05-07KVM: SEV: Allow per-guest configuration of GHCB protocol versionMichael Roth
The GHCB protocol version may be different from one guest to the next. Add a field to track it for each KVM instance and extend KVM_SEV_INIT2 to allow it to be configured by userspace. Now that all SEV-ES support for GHCB protocol version 2 is in place, go ahead and default to it when creating SEV-ES guests through the new KVM_SEV_INIT2 interface. Keep the older KVM_SEV_ES_INIT interface restricted to GHCB protocol version 1. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-5-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: SEV: Add GHCB handling for termination requestsMichael Roth
GHCB version 2 adds support for a GHCB-based termination request that a guest can issue when it reaches an error state and wishes to inform the hypervisor that it should be terminated. Implement support for that similarly to GHCB MSR-based termination requests that are already available to SEV-ES guests via earlier versions of the GHCB protocol. See 'Termination Request' in the 'Invoking VMGEXIT' section of the GHCB specification for more details. Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-4-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: SEV: Add GHCB handling for Hypervisor Feature Support requestsBrijesh Singh
Version 2 of the GHCB specification introduced advertisement of features that are supported by the Hypervisor. Now that KVM supports version 2 of the GHCB specification, bump the maximum supported protocol version. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-3-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: SEV: Add support to handle AP reset MSR protocolTom Lendacky
Add support for AP Reset Hold being invoked using the GHCB MSR protocol, available in version 2 of the GHCB specification. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240501071048.2208265-2-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Explicitly zero kvm_caps during vendor module loadSean Christopherson
Zero out all of kvm_caps when loading a new vendor module to ensure that KVM can't inadvertently rely on global initialization of a field, and add a comment above the definition of kvm_caps to call out that all fields needs to be explicitly computed during vendor module load. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_mce_cap on vendor module loadSean Christopherson
Effectively reset supported_mce_cap on vendor module load to ensure that capabilities aren't unintentionally preserved across module reload, e.g. if kvm-intel.ko added a module param to control LMCE support, or if someone somehow managed to load a vendor module that doesn't support LMCE after loading and unloading kvm-intel.ko. Practically speaking, this bug is a non-issue as kvm-intel.ko doesn't have a module param for LMCE, and there is no system in the world that supports both kvm-intel.ko and kvm-amd.ko. Fixes: c45dcc71b794 ("KVM: VMX: enable guest access to LMCE related MSRs") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86: Fully re-initialize supported_vm_types on vendor module loadSean Christopherson
Recompute the entire set of supported VM types when a vendor module is loaded, as preserving supported_vm_types across vendor module unload and reload can result in VM types being incorrectly treated as supported. E.g. if a vendor module is loaded with TDP enabled, unloaded, and then reloaded with TDP disabled, KVM_X86_SW_PROTECTED_VM will be incorrectly retained. Ditto for SEV_VM and SEV_ES_VM and their respective module params in kvm-amd.ko. Fixes: 2a955c4db1dd ("KVM: x86: Add supported_vm_types to kvm_caps") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240423165328.2853870-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfnsSean Christopherson
WARN if __kvm_faultin_pfn() generates a "no slot" pfn, and gracefully handle the unexpected behavior instead of continuing on with dangerous state, e.g. tdp_mmu_map_handle_target_level() _only_ checks fault->slot, and so could install a bogus PFN into the guest. The existing code is functionally ok, because kvm_faultin_pfn() pre-checks all of the cases that result in KVM_PFN_NOSLOT, but it is unnecessarily unsafe as it relies on __gfn_to_pfn_memslot() getting the _exact_ same memslot, i.e. not a re-retrieved pointer with KVM_MEMSLOT_INVALID set. And checking only fault->slot would fall apart if KVM ever added a flag or condition that forced emulation, similar to how KVM handles writes to read-only memslots. Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error valuesSean Christopherson
Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault() to harden KVM against using "uninitialized" values. In quotes because the fields are actually zero-initialized, and zero is a legal value for both page frame numbers and virtual addresses. E.g. failure to set "pfn" prior to creating an SPTE could result in KVM pointing at physical address '0', which is far less desirable than KVM generating a SPTE with reserved PA bits set and thus effectively killing the VM. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot" faultsSean Christopherson
Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot" fault to ensure that KVM doesn't use a bogus virtual address, e.g. if there *was* a slot but it's unusable (APIC access page), or if there really was no slot, in which case fault->hva will be '0' (which is a legal address for x86). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>