summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/lapic.c
AgeCommit message (Collapse)Author
2024-09-09KVM: x86: Move "ack" phase of local APIC IRQ delivery to separate APISean Christopherson
Split the "ack" phase, i.e. the movement of an interrupt from IRR=>ISR, out of kvm_get_apic_interrupt() and into a separate API so that nested VMX can acknowledge a specific interrupt _after_ emulating a VM-Exit from L2 to L1. To correctly emulate nested posted interrupts while APICv is active, KVM must: 1. find the highest pending interrupt. 2. check if that IRQ is L2's notification vector 3. emulate VM-Exit if the IRQ is NOT the notification vector 4. ACK the IRQ in L1 _after_ VM-Exit When APICv is active, the process of moving the IRQ from the IRR to the ISR also requires a VMWRITE to update vmcs01.GUEST_INTERRUPT_STATUS.SVI, and so acknowledging the interrupt before switching to vmcs01 would result in marking the IRQ as in-service in the wrong VMCS. KVM currently fudges around this issue by doing kvm_get_apic_interrupt() smack dab in the middle of emulating VM-Exit, but that hack doesn't play nice with nested posted interrupts, as notification vector IRQs don't trigger a VM-Exit in the first place. Cc: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240906043413.1049633-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)Sean Christopherson
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's IPI virtualization support, but only for AMD. While not stated anywhere in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled, KVM needs to match CPU behavior as some ICR ICR writes will be handled by the CPU, not by KVM. Add a kvm_x86_ops knob to control the underlying format used by the CPU to store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage format for x2APIC mode doesn't matter, and having the behavior follow AMD versus Intel will provide better test coverage and ease debugging. Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Move x2APIC ICR helper above kvm_apic_write_nodecode()Sean Christopherson
Hoist kvm_x2apic_icr_write() above kvm_apic_write_nodecode() so that a local helper to _read_ the x2APIC ICR can be added and used in the nodecode path without needing a forward declaration. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Enforce x2APIC's must-be-zero reserved ICR bitsSean Christopherson
Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that are must-be-zero on both Intel and AMD, i.e. any reserved bits other than the BUSY bit, which Intel ignores and basically says is undefined. KVM's xapic_state_test selftest has been fudging the bug since commit 4b88b1a518b3 ("KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode"), which essentially removed the testcase instead of fixing the bug. WARN if the nodecode path triggers a #GP, as the CPU is supposed to check reserved bits for ICR when it's partially virtualized. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Optimize local variable in start_sw_tscdeadline()Thorsten Blum
Change the data type of the local variable this_tsc_khz to u32 because virtual_tsc_khz is also declared as u32. Since do_div() casts the divisor to u32 anyway, changing the data type of this_tsc_khz to u32 also removes the following Coccinelle/coccicheck warning reported by do_div.cocci: WARNING: do_div() does a 64-by-32 division, please consider using div64_ul instead Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Link: https://lore.kernel.org/r/20240814203345.2234-2-thorsten.blum@toblux.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-13KVM: x86: Make x2APIC ID 100% readonlySean Christopherson
Ignore the userspace provided x2APIC ID when fixing up APIC state for KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM. Commit a92e2543d6a8 ("KVM: x86: use hardware-compatible format for APIC ID register"), which added the fixup, didn't intend to allow userspace to modify the x2APIC ID. In fact, that commit is when KVM first started treating the x2APIC ID as readonly, apparently to fix some race: static inline u32 kvm_apic_id(struct kvm_lapic *apic) { - return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff; + /* To avoid a race between apic_base and following APIC_ID update when + * switching to x2apic_mode, the x2apic mode returns initial x2apic id. + */ + if (apic_x2apic_mode(apic)) + return apic->vcpu->vcpu_id; + + return kvm_lapic_get_reg(apic, APIC_ID) >> 24; } Furthermore, KVM doesn't support delivering interrupts to vCPUs with a modified x2APIC ID, but KVM *does* return the modified value on a guest RDMSR and for KVM_GET_LAPIC. I.e. no remotely sane setup can actually work with a modified x2APIC ID. Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map calculation, which expects the LDR to align with the x2APIC ID. WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm] CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014 RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm] Call Trace: <TASK> kvm_apic_set_state+0x1cf/0x5b0 [kvm] kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm] kvm_vcpu_ioctl+0x663/0x8a0 [kvm] __x64_sys_ioctl+0xb8/0xf0 do_syscall_64+0x56/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fade8b9dd6f Unfortunately, the WARN can still trigger for other CPUs than the current one by racing against KVM_SET_LAPIC, so remove it completely. Reported-by: Michal Luczaj <mhal@rbox.co> Closes: https://lore.kernel.org/all/814baa0c-1eaa-4503-129f-059917365e80@rbox.co Reported-by: Haoyu Wu <haoyuwu254@gmail.com> Closes: https://lore.kernel.org/all/20240126161633.62529-1-haoyuwu254@gmail.com Reported-by: syzbot+545f1326f405db4e1c3e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000c2a6b9061cbca3c3@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240802202941.344889-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: x86: Eliminate log spam from limited APIC timer periodsJim Mattson
SAP's vSMP MemoryONE continuously requests a local APIC timer period less than 500 us, resulting in the following kernel log spam: kvm: vcpu 15: requested 70240 ns lapic timer period limited to 500000 ns kvm: vcpu 19: requested 52848 ns lapic timer period limited to 500000 ns kvm: vcpu 15: requested 70256 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70208 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 387520 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70160 ns lapic timer period limited to 500000 ns kvm: vcpu 66: requested 205744 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70224 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns limit_periodic_timer_frequency: 7569 callbacks suppressed ... To eliminate this spam, change the pr_info_ratelimited() in limit_periodic_timer_frequency() to pr_info_once(). Reported-by: James Houghton <jthoughton@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Message-ID: <20240724190640.2449291-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_opsWei Wang
Introduces kvm_x86_call(), to streamline the usage of static calls of kvm_x86_ops. The current implementation of these calls is verbose and could lead to alignment challenges. This makes the code susceptible to exceeding the "80 columns per single line of code" limit as defined in the coding-style document. Another issue with the existing implementation is that the addition of kvm_x86_ prefix to hooks at the static_call sites hinders code readability and navigation. kvm_x86_call() is added to improve code readability and maintainability, while adhering to the coding style guidelines. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16KVM: x86: Replace static_call_cond() with static_call()Wei Wang
The use of static_call_cond() is essentially the same as static_call() on x86 (e.g. static_call() now handles a NULL pointer as a NOP), so replace it with static_call() to simplify the code. Link: https://lore.kernel.org/all/3916caa1dcd114301a49beafa5030eca396745c1.1679456900.git.jpoimboe@kernel.org/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-2-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups
2024-06-05KVM: x86: Make nanoseconds per APIC bus cycle a VM variableIsaku Yamahata
Introduce the VM variable "nanoseconds per APIC bus cycle" in preparation to make the APIC bus frequency configurable. The TDX architecture hard-codes the core crystal clock frequency to 25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture does not allow the VMM to override the value. In addition, per Intel SDM: "The APIC timer frequency will be the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register." The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded APIC bus frequency of 1GHz. Introduce the VM variable "nanoseconds per APIC bus cycle" to prepare for allowing userspace to tell KVM to use the frequency that TDX mandates instead of the default 1Ghz. Doing so ensures that the guest doesn't have a conflicting view of the APIC bus frequency. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> [reinette: rework changelog] Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/ae75ce37c6c38bb4efd10a0a41932984c40b24ac.1714081726.git.reinette.chatre@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: x86: Drop support for hand tuning APIC timer advancement from userspaceSean Christopherson
Remove support for specifying a static local APIC timer advancement value, and instead present a read-only boolean parameter to let userspace enable or disable KVM's dynamic APIC timer advancement. Realistically, it's all but impossible for userspace to specify an advancement that is more precise than what KVM's adaptive tuning can provide. E.g. a static value needs to be tuned for the exact hardware and kernel, and if KVM is using hrtimers, likely requires additional tuning for the exact configuration of the entire system. Dropping support for a userspace provided value also fixes several flaws in the interface. E.g. KVM interprets a negative value other than -1 as a large advancement, toggling between a negative and positive value yields unpredictable behavior as vCPUs will switch from dynamic to static advancement, changing the advancement in the middle of VM creation can result in different values for vCPUs within a VM, etc. Those flaws are mostly fixable, but there's almost no justification for taking on yet more complexity (it's minimal complexity, but still non-zero). The only arguments against using KVM's adaptive tuning is if a setup needs a higher maximum, or if the adjustments are too reactive, but those are arguments for letting userspace control the absolute max advancement and the granularity of each adjustment, e.g. similar to how KVM provides knobs for halt polling. Link: https://lore.kernel.org/all/20240520115334.852510-1-zhoushuling@huawei.com Cc: Shuling Zhou <zhoushuling@huawei.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522010304.1650603-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platformsSandipan Das
On AMD and Hygon platforms, the local APIC does not automatically set the mask bit of the LVTPC register when handling a PMI and there is no need to clear it in the kernel's PMI handler. For guests, the mask bit is currently set by kvm_apic_local_deliver() and unless it is cleared by the guest kernel's PMI handler, PMIs stop arriving and break use-cases like sampling with perf record. This does not affect non-PerfMonV2 guests because PMIs are handled in the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC mask bit irrespective of the vendor. Before: $ perf record -e cycles:u true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ] After: $ perf record -e cycles:u true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ] Fixes: a16eb25b09c0 ("KVM: x86: Mask LVTPC when handling a PMI") Cc: stable@vger.kernel.org Signed-off-by: Sandipan Das <sandipan.das@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> [sean: use is_intel_compatible instead of !is_amd_or_hygon()] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240405235603.1173076-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-03-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ...
2024-03-11Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock.
2024-03-04KVM: x86/xen: inject vCPU upcall vector when local APIC is enabledDavid Woodhouse
Linux guests since commit b1c3497e604d ("x86/xen: Add support for HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU upcall vector when it's advertised in the Xen CPUID leaves. This upcall is injected through the guest's local APIC as an MSI, unlike the older system vector which was merely injected by the hypervisor any time the CPU was able to receive an interrupt and the upcall_pending flags is set in its vcpu_info. Effectively, that makes the per-CPU upcall edge triggered instead of level triggered, which results in the upcall being lost if the MSI is delivered when the local APIC is *disabled*. Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC for a vCPU is software enabled (in fact, on any write to the SPIV register which doesn't disable the APIC). Do the same in KVM since KVM doesn't provide a way for userspace to intervene and trap accesses to the SPIV register of a local APIC emulated by KVM. Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: x86: Move "KVM no-APIC vCPU" key management into local APIC codeSean Christopherson
Move incrementing and decrementing of kvm_has_noapic_vcpu into kvm_create_lapic() and kvm_free_lapic() respectively to fix a benign bug where KVM fails to decrement the count if vCPU creation ultimately fails, e.g. due to a memory allocation failing. Note, the bug is benign as kvm_has_noapic_vcpu is used purely to optimize lapic_in_kernel() checks, and that optimization is quite dubious. That, and practically speaking no setup that cares at all about performance runs with a userspace local APIC. Reported-by: Li RongQing <lirongqing@baidu.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Link: https://lore.kernel.org/r/20240209222047.394389-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-29KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safeBrijesh Singh
Implement a workaround for an SNP erratum where the CPU will incorrectly signal an RMP violation #PF if a hugepage (2MB or 1GB) collides with the RMP entry of a VMCB, VMSA or AVIC backing page. When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC backing pages as "in-use" via a reserved bit in the corresponding RMP entry after a successful VMRUN. This is done for _all_ VMs, not just SNP-Active VMs. If the hypervisor accesses an in-use page through a writable translation, the CPU will throw an RMP violation #PF. On early SNP hardware, if an in-use page is 2MB-aligned and software accesses any part of the associated 2MB region with a hugepage, the CPU will incorrectly treat the entire 2MB region as in-use and signal a an RMP violation #PF. To avoid this, the recommendation is to not use a 2MB-aligned page for the VMCB, VMSA or AVIC pages. Add a generic allocator that will ensure that the page returned is not 2MB-aligned and is safe to be used when SEV-SNP is enabled. Also implement similar handling for the VMCB/VMSA pages of nested guests. [ mdr: Squash in nested guest handling from Ashish, commit msg fixups. ] Reported-by: Alper Gun <alpergun@google.com> # for nested VMSA case Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Marc Orr <marcorr@google.com> Signed-off-by: Marc Orr <marcorr@google.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240126041126.1927228-22-michael.roth@amd.com
2023-12-07KVM: x86: Introduce helper to check if vector is set in Hyper-V SynICVitaly Kuznetsov
As a preparation to making Hyper-V emulation optional, create a dedicated kvm_hv_synic_has_vector() helper to avoid extra ifdefs in lapic.c. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-6-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07KVM: x86: Introduce helper to check if auto-EOI is set in Hyper-V SynICVitaly Kuznetsov
As a preparation to making Hyper-V emulation optional, create a dedicated kvm_hv_synic_auto_eoi_set() helper to avoid extra ifdefs in lapic.c No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231205103630.1391318-5-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-31Merge tag 'kvm-x86-apic-6.7' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 APIC changes for 6.7: - Purge VMX's posted interrupt descriptor *before* loading APIC state when handling KVM_SET_LAPIC. Purging the PID after loading APIC state results in lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC state, i.e. can immediately pend an IRQ if the expiry is in the past. - Clear the ICR.BUSY bit when handling trap-like x2APIC writes. This avoids a WARN, due to KVM expecting the BUSY bit to be cleared when sending IPIs.
2023-09-28KVM: x86: Clear bit12 of ICR after APIC-write VM-exitTao Su
When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. Under the x2APIC section, regarding ICR, the SDM says: It remains readable only to aid in debugging; however, software should not assume the value returned by reading the ICR is the last written value. I.e. the guest is allowed to set bit 12. However, the SDM also gives KVM free reign to do whatever it wants with the bit, so long as KVM's behavior doesn't confuse userspace or break KVM's ABI. Clear bit 12 so that it reads back as '0'. This approach is safer than "do nothing" and is consistent with the case where IPI virtualization is disabled or not supported, i.e., handle_fastpath_set_x2apic_icr_irqoff() -> kvm_x2apic_icr_write() Opportunistically replace the TODO with a comment calling out that eating the write is likely faster than a conditional branch around the busy bit. Link: https://lore.kernel.org/all/ZPj6iF0Q7iynn62p@google.com/ Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") Cc: stable@vger.kernel.org Signed-off-by: Tao Su <tao1.su@linux.intel.com> Tested-by: Yi Lai <yi1.lai@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20230914055504.151365-1-tao1.su@linux.intel.com [sean: tweak changelog, replace TODO with comment, drop local "val"] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28KVM: x86: Fix lapic timer interrupt lost after loading a snapshot.Haitao Shan
When running android emulator (which is based on QEMU 2.12) on certain Intel hosts with kernel version 6.3-rc1 or above, guest will freeze after loading a snapshot. This is almost 100% reproducible. By default, the android emulator will use snapshot to speed up the next launching of the same android guest. So this breaks the android emulator badly. I tested QEMU 8.0.4 from Debian 12 with an Ubuntu 22.04 guest by running command "loadvm" after "savevm". The same issue is observed. At the same time, none of our AMD platforms is impacted. More experiments show that loading the KVM module with "enable_apicv=false" can workaround it. The issue started to show up after commit 8e6ed96cdd50 ("KVM: x86: fire timer when it is migrated and expired, and in oneshot mode"). However, as is pointed out by Sean Christopherson, it is introduced by commit 967235d32032 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC"). commit 8e6ed96cdd50 ("KVM: x86: fire timer when it is migrated and expired, and in oneshot mode") just makes it easier to hit the issue. Having both commits, the oneshot lapic timer gets fired immediately inside the KVM_SET_LAPIC call when loading the snapshot. On Intel platforms with APIC virtualization and posted interrupt processing, this eventually leads to setting the corresponding PIR bit. However, the whole PIR bits get cleared later in the same KVM_SET_LAPIC call by apicv_post_state_restore. This leads to timer interrupt lost. The fix is to move vmx_apicv_post_state_restore to the beginning of the KVM_SET_LAPIC call and rename to vmx_apicv_pre_state_restore. What vmx_apicv_post_state_restore does is actually clearing any former apicv state and this behavior is more suitable to carry out in the beginning. Fixes: 967235d32032 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Haitao Shan <hshan@google.com> Link: https://lore.kernel.org/r/20230913000215.478387-1-hshan@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-25KVM: x86: Mask LVTPC when handling a PMIJim Mattson
Per the SDM, "When the local APIC handles a performance-monitoring counters interrupt, it automatically sets the mask flag in the LVT performance counter register." Add this behavior to KVM's local APIC emulation. Failure to mask the LVTPC entry results in spurious PMIs, e.g. when running Linux as a guest, PMI handlers that do a "late_ack" spew a large number of "dazed and confused" spurious NMI warnings. Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") Cc: stable@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> Tested-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230925173448.3518223-3-mizhang@google.com [sean: massage changelog, correct Fixes] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-31Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID
2023-08-02KVM: x86: Retry APIC optimized map recalc if vCPU is added/enabledSean Christopherson
Retry the optimized APIC map recalculation if an APIC-enabled vCPU shows up between allocating the map and filling in the map data. Conditionally reschedule before retrying even though the number of vCPUs that can be created is bounded by KVM. Retrying a few thousand times isn't so slow as to be hugely problematic, but it's not blazing fast either. Reset xapic_id_mistach on each retry as a vCPU could change its xAPIC ID between loops, but do NOT reset max_id. The map size also factors in whether or not a vCPU's local APIC is hardware-enabled, i.e. userspace and/or the guest can theoretically keep KVM retrying indefinitely. The only downside is that KVM will allocate more memory than is strictly necessary if the vCPU with the highest x2APIC ID disabled its APIC while the recalculation was in-progress. Refresh kvm->arch.apic_map_dirty to opportunistically change it from DIRTY => UPDATE_IN_PROGRESS to avoid an unnecessary recalc from a different task, i.e. if another task is waiting to attempt an update (which is likely since a retry happens if and only if an update is required). Link: https://lore.kernel.org/r/20230602233250.1014316-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-29KVM: x86: VMX: set irr_pending in kvm_apic_update_irrMaxim Levitsky
When the APICv is inhibited, the irr_pending optimization is used. Therefore, when kvm_apic_update_irr sets bits in the IRR, it must set irr_pending to true as well. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230726135945.260841-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29KVM: x86: VMX: __kvm_apic_update_irr must update the IRR atomicallyMaxim Levitsky
If APICv is inhibited, then IPIs from peer vCPUs are done by atomically setting bits in IRR. This means, that when __kvm_apic_update_irr copies PIR to IRR, it has to modify IRR atomically as well. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230726135945.260841-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-01Merge tag 'kvm-x86-misc-6.5' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 changes for 6.5: * Move handling of PAT out of MTRR code and dedup SVM+VMX code * Fix output of PIC poll command emulation when there's an interrupt * Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. * Misc cleanups
2023-06-16KVM: x86: Remove PRIx* definitions as they are solely for user spaceAndy Shevchenko
In the Linux kernel we do not support PRI.64 specifiers. Moreover they seem not to be used anyway here. Drop them. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20230616150233.83813-1-andriy.shevchenko@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-02KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-boundsSean Christopherson
Bail from kvm_recalculate_phys_map() and disable the optimized map if the target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added and/or enabled its local APIC after the map was allocated. This fixes an out-of-bounds access bug in the !x2apic_format path where KVM would write beyond the end of phys_map. Check the x2APIC ID regardless of whether or not x2APIC is enabled, as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC being enabled, i.e. the check won't get false postivies. Note, this also affects the x2apic_format path, which previously just ignored the "x2apic_id > new->max_apic_id" case. That too is arguably a bug fix, as ignoring the vCPU meant that KVM would not send interrupts to the vCPU until the next map recalculation. In practice, that "bug" is likely benign as a newly present vCPU/APIC would immediately trigger a recalc. But, there's no functional downside to disabling the map, and a future patch will gracefully handle the -E2BIG case by retrying instead of simply disabling the optimized map. Opportunistically add a sanity check on the xAPIC ID size, along with a comment explaining why the xAPIC ID is guaranteed to be "good". Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: 5b84b0291702 ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602233250.1014316-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01KVM: x86: Reinitialize xAPIC ID when userspace forces x2APIC => xAPICEmanuele Giuseppe Esposito
Reinitialize the xAPIC ID to the vCPU ID when userspace forces the APIC to transition directly from x2APIC to xAPIC mode, e.g. to emulate RESET. KVM already stuffs the xAPIC ID when the APIC is transitioned from DISABLED to xAPIC (commit 49bd29ba1dbd ("KVM: x86: reset APIC ID when enabling LAPIC")), i.e. userspace is conditioned to expect KVM to update the xAPIC ID, but KVM doesn't handle the architecturally-impossible case where userspace forces x2APIC=>xAPIC via KVM_SET_MSRS. On its own, the "bug" is benign, as userspace emulation of RESET will also stuff APIC registers via KVM_SET_LAPIC, i.e. will manually set the xAPIC ID. However, commit 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") introduced a bug, fixed by commit commit ef40757743b4 ("KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself"), that caused KVM to fail to properly update the xAPIC ID when handling KVM_SET_LAPIC. Refresh the xAPIC ID even though it's not strictly necessary so that KVM provides consistent behavior. Note, KVM follows Intel architecture with regard to handling the xAPIC ID and x2APIC IDs across mode transitions. For the APIC DISABLED case (commit 49bd29ba1dbd), Intel's SDM says the xAPIC ID _may_ be reinitialized 10.4.3 Enabling or Disabling the Local APIC When IA32_APIC_BASE[11] is set to 0, prior initialization to the APIC may be lost and the APIC may return to the state described in Section 10.4.7.1, “Local APIC State After Power-Up or Reset.” 10.4.7.1 Local APIC State After Power-Up or Reset ... The local APIC ID register is set to a unique APIC ID. ... i.e. KVM's behavior is legal as per Intel's architecture. In practice, Intel's behavior is N/A as modern Intel CPUs (since at least Haswell) make the xAPIC ID fully read-only. And for xAPIC => x2APIC transitions (commit 257b9a5faab5 ("KVM: x86: use correct APIC ID on x2APIC transition")), Intel's SDM says: Any APIC ID value written to the memory-mapped local APIC ID register is not preserved. AMD's APM says nothing (that I could find) about the xAPIC ID when the APIC is DISABLED, but testing on bare metal (Rome) shows that the xAPIC ID is preserved when the APIC is DISABLED and re-enabled in xAPIC mode. AMD also preserves the xAPIC ID when the APIC is transitioned from xAPIC to x2APIC, i.e. allows a backdoor write of the x2APIC ID, which is again not emulated by KVM. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Link: https://lore.kernel.org/all/20230109130605.2013555-2-eesposit@redhat.com [sean: rewrite changelog, set xAPIC ID iff APIC is enabled] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: fire timer when it is migrated and expired, and in oneshot modeLi RongQing
when the vCPU was migrated, if its timer is expired, KVM _should_ fire the timer ASAP, zeroing the deadline here will cause the timer to immediately fire on the destination Cc: Sean Christopherson <seanjc@google.com> Cc: Peter Shier <pshier@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20230106040625.8404-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Split out logic to generate "readable" APIC regs mask to helperSean Christopherson
Move the generation of the readable APIC regs bitmask to a standalone helper so that VMX can use the mask for its MSR interception bitmaps. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Mark x2APIC DFR reg as non-existent for x2APICSean Christopherson
Mark APIC_DFR as being invalid/non-existent in x2APIC mode instead of handling it as a one-off check in kvm_x2apic_msr_read(). This will allow reusing "valid_reg_mask" to generate VMX's interception bitmaps for x2APIC. Handling DFR in the common read path may also fix the Hyper-V PV MSR interface, if that can coexist with x2APIC. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Inject #GP on x2APIC WRMSR that sets reserved bits 63:32Sean Christopherson
Reject attempts to set bits 63:32 for 32-bit x2APIC registers, i.e. all x2APIC registers except ICR. Per Intel's SDM: Non-zero writes (by WRMSR instruction) to reserved bits to these registers will raise a general protection fault exception Opportunistically fix a typo in a nearby comment. Reported-by: Marc Orr <marcorr@google.com> Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Inject #GP if WRMSR sets reserved bits in APIC Self-IPISean Christopherson
Inject a #GP if the guest attempts to set reserved bits in the x2APIC-only Self-IPI register. Bits 7:0 hold the vector, all other bits are reserved. Reported-by: Marc Orr <marcorr@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Venkatesh Srinivas <venkateshs@chromium.org> Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: remove redundant ret variablezhang songyi
Return value from apic_get_tmcct() directly instead of taking this in another redundant variable. Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn> Link: https://lore.kernel.org/r/202211231704457807160@zte.com.cn Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24Merge branch 'kvm-lapic-fix-and-cleanup' into HEADPaolo Bonzini
The first half or so patches fix semi-urgent, real-world relevant APICv and AVIC bugs. The second half fixes a variety of AVIC and optimized APIC map bugs where KVM doesn't play nice with various edge cases that are architecturally legal(ish), but are unlikely to occur in most real world scenarios Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Add helpers to recalc physical vs. logical optimized APIC mapsSean Christopherson
Move the guts of kvm_recalculate_apic_map()'s main loop to two separate helpers to handle recalculating the physical and logical pieces of the optimized map. Having 100+ lines of code in the for-loop makes it hard to understand what is being calculated where. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-34-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Allow APICv APIC ID inhibit to be clearedGreg Edwards
Legacy kernels prior to commit 4399c03c6780 ("x86/apic: Remove verify_local_APIC()") write the APIC ID of the boot CPU twice to verify a functioning local APIC. This results in APIC acceleration inhibited on these kernels for reason APICV_INHIBIT_REASON_APIC_ID_MODIFIED. Allow the APICV_INHIBIT_REASON_APIC_ID_MODIFIED inhibit reason to be cleared if/when all APICs in xAPIC mode set their APIC ID back to the expected vcpu_id value. Fold the functionality previously in kvm_lapic_xapic_id_updated() into kvm_recalculate_apic_map(), as this allows examining all APICs in one pass. Fixes: 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") Signed-off-by: Greg Edwards <gedwards@ddn.com> Link: https://lore.kernel.org/r/20221117183247.94314-1-gedwards@ddn.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: SVM: Inhibit AVIC if vCPUs are aliased in logical modeSean Christopherson
Inhibit SVM's AVIC if multiple vCPUs are aliased to the same logical ID. Architecturally, all CPUs whose logical ID matches the MDA are supposed to receive the interrupt; overwriting existing entries in AVIC's logical=>physical map can result in missed IPIs. Fixes: 18f40c53e10f ("svm: Add VMEXIT handlers for AVIC") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Inhibit APICv/AVIC if the optimized physical map is disabledSean Christopherson
Inhibit APICv/AVIC if the optimized physical map is disabled so that KVM KVM provides consistent APIC behavior if xAPIC IDs are aliased due to vcpu_id being truncated and the x2APIC hotplug hack isn't enabled. If the hotplug hack is disabled, events that are emulated by KVM will follow architectural behavior (all matching vCPUs receive events, even if the "match" is due to truncation), whereas APICv and AVIC will deliver events only to the first matching vCPU, i.e. the vCPU that matches without truncation. Note, the "extra" inhibit is needed because KVM deliberately ignores mismatches due to truncation when applying the APIC_ID_MODIFIED inhibit so that large VMs (>255 vCPUs) can run with APICv/AVIC. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDsSean Christopherson
Apply KVM's hotplug hack if and only if userspace has enabled 32-bit IDs for x2APIC. If 32-bit IDs are not enabled, disable the optimized map to honor x86 architectural behavior if multiple vCPUs shared a physical APIC ID. As called out in the changelog that added the hack, all CPUs whose (possibly truncated) APIC ID matches the target are supposed to receive the IPI. KVM intentionally differs from real hardware, because real hardware (Knights Landing) does just "x2apic_id & 0xff" to decide whether to accept the interrupt in xAPIC mode and it can deliver one interrupt to more than one physical destination, e.g. 0x123 to 0x123 and 0x23. Applying the hack even when x2APIC is not fully enabled means KVM doesn't correctly handle scenarios where the guest has aliased xAPIC IDs across multiple vCPUs, as only the vCPU with the lowest vCPU ID will receive any interrupts. It's extremely unlikely any real world guest aliases APIC IDs, or even modifies APIC IDs, but KVM's behavior is arbitrary, e.g. the lowest vCPU ID "wins" regardless of which vCPU is "aliasing" and which vCPU is "normal". Furthermore, the hack is _not_ guaranteed to work! The hack works if and only if the optimized APIC map is successfully allocated. If the map allocation fails (unlikely), KVM will fall back to its unoptimized behavior, which _does_ honor the architectural behavior. Pivot on 32-bit x2APIC IDs being enabled as that is required to take advantage of the hotplug hack (see kvm_apic_state_fixup()), i.e. won't break existing setups unless they are way, way off in the weeds. And an entry in KVM's errata to document the hack. Alternatively, KVM could provide an actual x2APIC quirk and document the hack that way, but there's unlikely to ever be a use case for disabling the quirk. Go the errata route to avoid having to validate a quirk no one cares about. Fixes: 5bd5db385b3e ("KVM: x86: allow hotplug of VCPU with APIC ID over 0xff") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Disable APIC logical map if vCPUs are aliased in logical modeSean Christopherson
Disable the optimized APIC logical map if multiple vCPUs are aliased to the same logical ID. Architecturally, all CPUs whose logical ID matches the MDA are supposed to receive the interrupt; overwriting existing map entries can result in missed IPIs. Fixes: 1e08ec4a130e ("KVM: optimize apic interrupt delivery") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Disable APIC logical map if logical ID covers multiple MDAsSean Christopherson
Disable the optimized APIC logical map if a logical ID covers multiple MDAs, i.e. if a vCPU has multiple bits set in its ID. In logical mode, events match if "ID & MDA != 0", i.e. creating an entry for only the first bit can cause interrupts to be missed. Note, creating an entry for every bit is also wrong as KVM would generate IPIs for every matching bit. It would be possible to teach KVM to play nice with this edge case, but it is very much an edge case and probably not used in any real world OS, i.e. it's not worth optimizing. Fixes: 1e08ec4a130e ("KVM: optimize apic interrupt delivery") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Skip redundant x2APIC logical mode optimized cluster setupSean Christopherson
Skip the optimized cluster[] setup for x2APIC logical mode, as KVM reuses the optimized map's phys_map[] and doesn't actually need to insert the target apic into the cluster[]. The LDR is derived from the x2APIC ID, and both are read-only in KVM, thus the vCPU's cluster[ldr] is guaranteed to be the same entry as the vCPU's phys_map[x2apic_id] entry. Skipping the unnecessary setup will allow a future fix for aliased xAPIC logical IDs to simply require that cluster[ldr] is non-NULL, i.e. won't have to special case x2APIC. Alternatively, the future check could allow "cluster[ldr] == apic", but that ends up being terribly confusing because cluster[ldr] is only set at the very end, i.e. it's only possible due to x2APIC's shenanigans. Another alternative would be to send x2APIC down a separate path _after_ the calculation and then assert that all of the above, but the resulting code is rather messy, and it's arguably unnecessary since asserting that the actual LDR matches the expected LDR means that simply testing that interrupts are delivered correctly provides the same guarantees. Reported-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Explicitly track all possibilities for APIC map's logical modesSean Christopherson
Track all possibilities for the optimized APIC map's logical modes instead of overloading the pseudo-bitmap and treating any "unknown" value as "invalid". As documented by the now-stale comment above the mode values, the values did have meaning when the optimized map was originally added. That dependent logical was removed by commit e45115b62f9a ("KVM: x86: use physical LAPIC array for logical x2APIC"), but the obfuscated behavior and its comment were left behind. Opportunistically rename "mode" to "logical_mode", partly to make it clear that the "disabled" case applies only to the logical map, but also to prove that there is no lurking code that expects "mode" to be a bitmap. Functionally, this is a glorified nop. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Explicitly skip optimized logical map setup if vCPU's LDR==0Sean Christopherson
Explicitly skip the optimized map setup if the vCPU's LDR is '0', i.e. if the vCPU will never respond to logical mode interrupts. KVM already skips setup in this case, but relies on kvm_apic_map_get_logical_dest() to generate mask==0. KVM still needs the mask=0 check as a non-zero LDR can yield mask==0 depending on the mode, but explicitly handling the LDR will make it simpler to clean up the logical mode tracking in the future. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Inhibit APIC memslot if x2APIC and AVIC are enabledSean Christopherson
Free the APIC access page memslot if any vCPU enables x2APIC and SVM's AVIC is enabled to prevent accesses to the virtual APIC on vCPUs with x2APIC enabled. On AMD, if its "hybrid" mode is enabled (AVIC is enabled when x2APIC is enabled even without x2AVIC support), keeping the APIC access page memslot results in the guest being able to access the virtual APIC page as x2APIC is fully emulated by KVM. I.e. hardware isn't aware that the guest is operating in x2APIC mode. Exempt nested SVM's update of APICv state from the new logic as x2APIC can't be toggled on VM-Exit. In practice, invoking the x2APIC logic should be harmless precisely because it should be a glorified nop, but play it safe to avoid latent bugs, e.g. with dropping the vCPU's SRCU lock. Intel doesn't suffer from the same issue as APICv has fully independent VMCS controls for xAPIC vs. x2APIC virtualization. Technically, KVM should provide bus error semantics and not memory semantics for the APIC page when x2APIC is enabled, but KVM already provides memory semantics in other scenarios, e.g. if APICv/AVIC is enabled and the APIC is hardware disabled (via APIC_BASE MSR). Note, checking apic_access_memslot_enabled without taking locks relies it being set during vCPU creation (before kvm_vcpu_reset()). vCPUs can race to set the inhibit and delete the memslot, i.e. can get false positives, but can't get false negatives as apic_access_memslot_enabled can't be toggled "on" once any vCPU reaches KVM_RUN. Opportunistically drop the "can" while updating avic_activate_vmcb()'s comment, i.e. to state that KVM _does_ support the hybrid mode. Move the "Note:" down a line to conform to preferred kernel/KVM multi-line comment style. Opportunistically update the apicv_update_lock comment, as it isn't actually used to protect apic_access_memslot_enabled (which is protected by slots_lock). Fixes: 0e311d33bfbe ("KVM: SVM: Introduce hybrid-AVIC mode") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>