summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2023-01-24KVM: VMX: Access @flags as a 32-bit value in __vmx_vcpu_run()Sean Christopherson
Access @flags using 32-bit operands when saving and testing @flags for VMX_RUN_VMRESUME, as using 8-bit operands is unnecessarily fragile due to relying on VMX_RUN_VMRESUME being in bits 0-7. The behavior of treating @flags a single byte is a holdover from when the param was "bool launched", i.e. is not deliberate. Cc: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20221119003747.2615229-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: SVM: Account scratch allocations used to decrypt SEV guest memoryAnish Ghulati
Account the temp/scratch allocation used to decrypt unaligned debug accesses to SEV guest memory, the allocation is very much tied to the target VM. Reported-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Anish Ghulati <aghulati@google.com> Link: https://lore.kernel.org/r/20230113220923.2834699-1-aghulati@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: svm/avic: Drop "struct kvm_x86_ops" for avic_hardware_setup()Like Xu
Even in commit 4bdec12aa8d6 ("KVM: SVM: Detect X2APIC virtualization (x2AVIC) support"), where avic_hardware_setup() was first introduced, its only pass-in parameter "struct kvm_x86_ops *ops" is not used at all. Clean it up a bit to avoid compiler ranting from LLVM toolchain. Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221109115952.92816-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: SVM: remove redundant ret variablezhang songyi
Return value from svm_nmi_blocked() directly instead of taking this in another redundant variable. Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn> Link: https://lore.kernel.org/r/202211282003389362484@zte.com.cn Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: Introduce masked events to the pmu event filterAaron Lewis
When building a list of filter events, it can sometimes be a challenge to fit all the events needed to adequately restrict the guest into the limited space available in the pmu event filter. This stems from the fact that the pmu event filter requires each event (i.e. event select + unit mask) be listed, when the intention might be to restrict the event select all together, regardless of it's unit mask. Instead of increasing the number of filter events in the pmu event filter, add a new encoding that is able to do a more generalized match on the unit mask. Introduce masked events as another encoding the pmu event filter understands. Masked events has the fields: mask, match, and exclude. When filtering based on these events, the mask is applied to the guest's unit mask to see if it matches the match value (i.e. umask & mask == match). The exclude bit can then be used to exclude events from that match. E.g. for a given event select, if it's easier to say which unit mask values shouldn't be filtered, a masked event can be set up to match all possible unit mask values, then another masked event can be set up to match the unit mask values that shouldn't be filtered. Userspace can query to see if this feature exists by looking for the capability, KVM_CAP_PMU_EVENT_MASKED_EVENTS. This feature is enabled by setting the flags field in the pmu event filter to KVM_PMU_EVENT_FLAG_MASKED_EVENTS. Events can be encoded by using KVM_PMU_ENCODE_MASKED_ENTRY(). It is an error to have a bit set outside the valid bits for a masked event, and calls to KVM_SET_PMU_EVENT_FILTER will return -EINVAL in such cases, including the high bits of the event select (35:32) if called on Intel. With these updates the filter matching code has been updated to match on a common event. Masked events were flexible enough to handle both event types, so they were used as the common event. This changes how guest events get filtered because regardless of the type of event used in the uAPI, they will be converted to masked events. Because of this there could be a slight performance hit because instead of matching the filter event with a lookup on event select + unit mask, it does a lookup on event select then walks the unit masks to find the match. This shouldn't be a big problem because I would expect the set of common event selects to be small, and if they aren't the set can likely be reduced by using masked events to generalize the unit mask. Using one type of event when filtering guest events allows for a common code path to be used. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-5-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: prepare the pmu event filter for masked eventsAaron Lewis
Refactor check_pmu_event_filter() in preparation for masked events. No functional changes intended Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-4-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: Remove impossible events from the pmu event filterAaron Lewis
If it's not possible for an event in the pmu event filter to match a pmu event being programmed by the guest, it's pointless to have it in the list. Opt for a shorter list by removing those events. Because this is established uAPI the pmu event filter can't outright rejected these events as garbage and return an error. Instead, play nice and remove them from the list. Also, opportunistically rewrite the comment when the filter is set to clarify that it guards against *all* TOCTOU attacks on the verified data. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-3-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/pmu: Correct the mask used in a pmu event filter lookupAaron Lewis
When checking if a pmu event the guest is attempting to program should be filtered, only consider the event select + unit mask in that decision. Use an architecture specific mask to mask out all other bits, including bits 35:32 on Intel. Those bits are not part of the event select and should not be considered in that decision. Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter") Signed-off-by: Aaron Lewis <aaronlewis@google.com> Link: https://lore.kernel.org/r/20221220161236.555143-2-aaronlewis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Use kstrtobool() instead of strtobool()Christophe JAILLET
strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/670882aa04dbdd171b46d3b20ffab87158454616.1673689135.git.christophe.jaillet@wanadoo.fr Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Cleanup range-based flushing for given pageHou Wenlong
Use the new kvm_flush_remote_tlbs_gfn() helper to cleanup the call sites of range-based flushing for given page, which makes the code clear. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/593ee1a876ece0e819191c0b23f56b940d6686db.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Fix wrong gfn range of tlb flushing in validate_direct_spte()Hou Wenlong
The spte pointing to the children SP is dropped, so the whole gfn range covered by the children SP should be flushed. Although, Hyper-V may treat a 1-page flush the same if the address points to a huge page, it still would be better to use the correct size of huge page. Fixes: c3134ce240eed ("KVM: Replace old tlb flush function with new one to flush a specified range.") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/5f297c566f7d7ff2ea6da3c66d050f69ce1b8ede.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Fix wrong start gfn of tlb flushing with rangeHou Wenlong
When a spte is dropped, the start gfn of tlb flushing should be the gfn of spte not the base gfn of SP which contains the spte. Also introduce a helper function to do range-based flushing when a spte is dropped, which would help prevent future buggy use of kvm_flush_remote_tlbs_with_address() in such case. Fixes: c3134ce240eed ("KVM: Replace old tlb flush function with new one to flush a specified range.") Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/72ac2169a261976f00c1703e88cda676dfb960f5.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Reduce gfn range of tlb flushing in ↵Hou Wenlong
tdp_mmu_map_handle_target_level() Since the children SP is zapped, the gfn range of tlb flushing should be the range covered by children SP not parent SP. Replace sp->gfn which is the base gfn of parent SP with iter->gfn and use the correct size of gfn range for children SP to reduce tlb flushing range. Fixes: bb95dfb9e2df ("KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/528ab9c784a486e9ce05f61462ad9260796a8732.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Fix wrong gfn range of tlb flushing in kvm_set_pte_rmapp()Hou Wenlong
When the spte of hupe page is dropped in kvm_set_pte_rmapp(), the whole gfn range covered by the spte should be flushed. However, rmap_walk_init_level() doesn't align down the gfn for new level like tdp iterator does, then the gfn used in kvm_set_pte_rmapp() is not the base gfn of huge page. And the size of gfn range is wrong too for huge page. Use the base gfn of huge page and the size of huge page for flushing tlbs for huge page. Also introduce a helper function to flush the given page (huge or not) of guest memory, which would help prevent future buggy use of kvm_flush_remote_tlbs_with_address() in such case. Fixes: c3134ce240eed ("KVM: Replace old tlb flush function with new one to flush a specified range.") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/0ce24d7078fa5f1f8d64b0c59826c50f32f8065e.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: Move round_gfn_for_level() helper into mmu_internal.hHou Wenlong
Rounding down the GFN to a huge page size is a common pattern throughout KVM, so move round_gfn_for_level() helper in tdp_iter.c to mmu_internal.h for common usage. Also rename it as gfn_round_for_level() to use gfn_* prefix and clean up the other call sites. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/415c64782f27444898db650e21cf28eeb6441dfa.1665214747.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/mmu: fix an incorrect comment in kvm_mmu_new_pgd()Wei Liu
There is no function named kvm_mmu_ensure_valid_pgd(). Fix the comment and remove the pair of braces to conform to Linux kernel coding style. Signed-off-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221128214709.224710-1-wei.liu@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24kvm: x86/mmu: Don't clear write flooding for direct SPLai Jiangshan
Although there is no harm, but there is no point to clear write flooding for direct SP. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230105100310.6700-1-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24kvm: x86/mmu: Rename SPTE_TDP_AD_ENABLED_MASK to SPTE_TDP_AD_ENABLEDLai Jiangshan
SPTE_TDP_AD_ENABLED_MASK, SPTE_TDP_AD_DISABLED_MASK and SPTE_TDP_AD_WRPROT_ONLY_MASK are actual value, not mask. Remove "MASK" from their names. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230105100204.6521-1-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/xen: update Xen CPUID Leaf 4 (tsc info) sub-leaves, if presentPaul Durrant
The scaling information in subleaf 1 should match the values set by KVM in the 'vcpu_info' sub-structure 'time_info' (a.k.a. pvclock_vcpu_time_info) which is shared with the guest, but is not directly available to the VMM. The offset values are not set since a TSC offset is already applied. The TSC frequency should also be set in sub-leaf 2. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230106103600.528-3-pdurrant@amazon.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86/cpuid: generalize kvm_update_kvm_cpuid_base() and also capture limitPaul Durrant
A subsequent patch will need to acquire the CPUID leaf range for emulated Xen so explicitly pass the signature of the hypervisor we're interested in to the new function. Also introduce a new kvm_hypervisor_cpuid structure so we can neatly store both the base and limit leaf indices. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230106103600.528-2-pdurrant@amazon.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Replace cpu_dirty_logging_count with nr_memslots_dirty_loggingDavid Matlack
Drop cpu_dirty_logging_count in favor of nr_memslots_dirty_logging. Both fields count the number of memslots that have dirty-logging enabled, with the only difference being that cpu_dirty_logging_count is only incremented when using PML. So while nr_memslots_dirty_logging is not a direct replacement for cpu_dirty_logging_count, it can be combined with enable_pml to get the same information. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230105214303.2919415-1-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Advertise fast REP string features inherent to the CPUJim Mattson
Fast zero-length REP MOVSB, fast short REP STOSB, and fast short REP {CMPSB,SCASB} are inherent features of the processor that cannot be hidden by the hypervisor. When these features are present on the host, enumerate them in KVM_GET_SUPPORTED_CPUID. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220901211811.2883855-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: fire timer when it is migrated and expired, and in oneshot modeLi RongQing
when the vCPU was migrated, if its timer is expired, KVM _should_ fire the timer ASAP, zeroing the deadline here will cause the timer to immediately fire on the destination Cc: Sean Christopherson <seanjc@google.com> Cc: Peter Shier <pshier@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20230106040625.8404-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: Intercept reads to invalid and write-only x2APIC registersSean Christopherson
Intercept reads to invalid (non-existent) and write-only x2APIC registers when configuring VMX's MSR bitmaps for x2APIC+APICv. When APICv is fully enabled, Intel hardware doesn't validate the registers on RDMSR and instead blindly retrieves data from the vAPIC page, i.e. it's software's responsibility to intercept reads to non-existent and write-only MSRs. Fixes: 8d14695f9542 ("x86, apicv: add virtual x2apic support") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: VMX: Always intercept accesses to unsupported "extended" x2APIC regsSean Christopherson
Don't clear the "read" bits for x2APIC registers above SELF_IPI (APIC regs 0x400 - 0xff0, MSRs 0x840 - 0x8ff). KVM doesn't emulate registers in that space (there are a smattering of AMD-only extensions) and so should intercept reads in order to inject #GP. When APICv is fully enabled, Intel hardware doesn't validate the registers on RDMSR and instead blindly retrieves data from the vAPIC page, i.e. it's software's responsibility to intercept reads to non-existent MSRs. Fixes: 8d14695f9542 ("x86, apicv: add virtual x2apic support") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20230107011025.565472-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Split out logic to generate "readable" APIC regs mask to helperSean Christopherson
Move the generation of the readable APIC regs bitmask to a standalone helper so that VMX can use the mask for its MSR interception bitmaps. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Mark x2APIC DFR reg as non-existent for x2APICSean Christopherson
Mark APIC_DFR as being invalid/non-existent in x2APIC mode instead of handling it as a one-off check in kvm_x2apic_msr_read(). This will allow reusing "valid_reg_mask" to generate VMX's interception bitmaps for x2APIC. Handling DFR in the common read path may also fix the Hyper-V PV MSR interface, if that can coexist with x2APIC. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Inject #GP on x2APIC WRMSR that sets reserved bits 63:32Sean Christopherson
Reject attempts to set bits 63:32 for 32-bit x2APIC registers, i.e. all x2APIC registers except ICR. Per Intel's SDM: Non-zero writes (by WRMSR instruction) to reserved bits to these registers will raise a general protection fault exception Opportunistically fix a typo in a nearby comment. Reported-by: Marc Orr <marcorr@google.com> Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: Inject #GP if WRMSR sets reserved bits in APIC Self-IPISean Christopherson
Inject a #GP if the guest attempts to set reserved bits in the x2APIC-only Self-IPI register. Bits 7:0 hold the vector, all other bits are reserved. Reported-by: Marc Orr <marcorr@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Venkatesh Srinivas <venkateshs@chromium.org> Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20230107011025.565472-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24KVM: x86: remove redundant ret variablezhang songyi
Return value from apic_get_tmcct() directly instead of taking this in another redundant variable. Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn> Link: https://lore.kernel.org/r/202211231704457807160@zte.com.cn Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24Merge branch 'kvm-lapic-fix-and-cleanup' into HEADPaolo Bonzini
The first half or so patches fix semi-urgent, real-world relevant APICv and AVIC bugs. The second half fixes a variety of AVIC and optimized APIC map bugs where KVM doesn't play nice with various edge cases that are architecturally legal(ish), but are unlikely to occur in most real world scenarios Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-24Merge branch 'kvm-v6.2-rc4-fixes' into HEADPaolo Bonzini
ARM: * Fix the PMCR_EL0 reset value after the PMU rework * Correctly handle S2 fault triggered by a S1 page table walk by not always classifying it as a write, as this breaks on R/O memslots * Document why we cannot exit with KVM_EXIT_MMIO when taking a write fault from a S1 PTW on a R/O memslot * Put the Apple M2 on the naughty list for not being able to correctly implement the vgic SEIS feature, just like the M1 before it * Reviewer updates: Alex is stepping down, replaced by Zenghui x86: * Fix various rare locking issues in Xen emulation and teach lockdep to detect them * Documentation improvements * Do not return host topology information from KVM_GET_SUPPORTED_CPUID
2023-01-22KVM: x86/vmx: Do not skip segment attributes if unusable bit is setHendrik Borghorst
When serializing and deserializing kvm_sregs, attributes of the segment descriptors are stored by user space. For unusable segments, vmx_segment_access_rights skips all attributes and sets them to 0. This means we zero out the DPL (Descriptor Privilege Level) for unusable entries. Unusable segments are - contrary to their name - usable in 64bit mode and are used by guests to for example create a linear map through the NULL selector. VMENTER checks if SS.DPL is correct depending on the CS segment type. For types 9 (Execute Only) and 11 (Execute Read), CS.DPL must be equal to SS.DPL [1]. We have seen real world guests setting CS to a usable segment with DPL=3 and SS to an unusable segment with DPL=3. Once we go through an sregs get/set cycle, SS.DPL turns to 0. This causes the virtual machine to crash reproducibly. This commit changes the attribute logic to always preserve attributes for unusable segments. According to [2] SS.DPL is always saved on VM exits, regardless of the unusable bit so user space applications should have saved the information on serialization correctly. [3] specifies that besides SS.DPL the rest of the attributes of the descriptors are undefined after VM entry if unusable bit is set. So, there should be no harm in setting them all to the previous state. [1] Intel SDM Vol 3C 26.3.1.2 Checks on Guest Segment Registers [2] Intel SDM Vol 3C 27.3.2 Saving Segment Registers and Descriptor-Table Registers [3] Intel SDM Vol 3C 26.3.2.2 Loading Guest Segment Registers and Descriptor-Table Registers Cc: Alexander Graf <graf@amazon.de> Cc: stable@vger.kernel.org Signed-off-by: Hendrik Borghorst <hborghor@amazon.de> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Alexander Graf <graf@amazon.com> Message-Id: <20221114164823.69555-1-hborghor@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Add helpers to recalc physical vs. logical optimized APIC mapsSean Christopherson
Move the guts of kvm_recalculate_apic_map()'s main loop to two separate helpers to handle recalculating the physical and logical pieces of the optimized map. Having 100+ lines of code in the for-loop makes it hard to understand what is being calculated where. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-34-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Allow APICv APIC ID inhibit to be clearedGreg Edwards
Legacy kernels prior to commit 4399c03c6780 ("x86/apic: Remove verify_local_APIC()") write the APIC ID of the boot CPU twice to verify a functioning local APIC. This results in APIC acceleration inhibited on these kernels for reason APICV_INHIBIT_REASON_APIC_ID_MODIFIED. Allow the APICV_INHIBIT_REASON_APIC_ID_MODIFIED inhibit reason to be cleared if/when all APICs in xAPIC mode set their APIC ID back to the expected vcpu_id value. Fold the functionality previously in kvm_lapic_xapic_id_updated() into kvm_recalculate_apic_map(), as this allows examining all APICs in one pass. Fixes: 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") Signed-off-by: Greg Edwards <gedwards@ddn.com> Link: https://lore.kernel.org/r/20221117183247.94314-1-gedwards@ddn.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Track required APICv inhibits with variable, not callbackSean Christopherson
Track the per-vendor required APICv inhibits with a variable instead of calling into vendor code every time KVM wants to query the set of required inhibits. The required inhibits are a property of the vendor's virtualization architecture, i.e. are 100% static. Using a variable allows the compiler to inline the check, e.g. generate a single-uop TEST+Jcc, and thus eliminates any desire to avoid checking inhibits for performance reasons. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13Revert "KVM: SVM: Do not throw warning when calling avic_vcpu_load on a ↵Sean Christopherson
running vcpu" Turns out that some warnings exist for good reasons. Restore the warning in avic_vcpu_load() that guards against calling avic_vcpu_load() on a running vCPU now that KVM avoids doing so when switching between x2APIC and xAPIC. The entire point of the WARN is to highlight that KVM should not be reloading an AVIC. Opportunistically convert the WARN_ON() to WARN_ON_ONCE() to avoid spamming the kernel if it does fire. This reverts commit c0caeee65af3944b7b8abbf566e7cc1fae15c775. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-31-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: SVM: Ignore writes to Remote Read Data on AVIC write trapsSean Christopherson
Drop writes to APIC_RRR, a.k.a. Remote Read Data Register, on AVIC unaccelerated write traps. The register is read-only and isn't emulated by KVM. Sending the register through kvm_apic_write_nodecode() will result in screaming when x2APIC is enabled due to the unexpected failure to retrieve the MSR (KVM expects that only "legal" accesses will trap). Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: SVM: Handle multiple logical targets in AVIC kick fastpathSean Christopherson
Iterate over all target logical IDs in the AVIC kick fastpath instead of bailing if there is more than one target. Now that KVM inhibits AVIC if vCPUs aren't mapped 1:1 with logical IDs, each bit in the destination is guaranteed to match to at most one vCPU, i.e. iterating over the bitmap is guaranteed to kick each valid target exactly once. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-29-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: SVM: Require logical ID to be power-of-2 for AVIC entrySean Christopherson
Do not modify AVIC's logical ID table if the logical ID portion of the LDR is not a power-of-2, i.e. if the LDR has multiple bits set. Taking only the first bit means that KVM will fail to match MDAs that intersect with "higher" bits in the "ID" The "ID" acts as a bitmap, but is referred to as an ID because there's an implicit, unenforced "requirement" that software only set one bit. This edge case is arguably out-of-spec behavior, but KVM cleanly handles it in all other cases, e.g. the optimized logical map (and AVIC!) is also disabled in this scenario. Refactor the code to consolidate the checks, and so that the code looks more like avic_kick_target_vcpus_fast(). Fixes: 18f40c53e10f ("svm: Add VMEXIT handlers for AVIC") Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-28-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: SVM: Update svm->ldr_reg cache even if LDR is "bad"Sean Christopherson
Update SVM's cache of the LDR even if the new value is "bad". Leaving stale information in the cache can result in KVM missing updates and/or invalidating the wrong entry, e.g. if avic_invalidate_logical_id_entry() is triggered after a different vCPU has "claimed" the old LDR. Fixes: 18f40c53e10f ("svm: Add VMEXIT handlers for AVIC") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: SVM: Always update local APIC on writes to logical dest registerSean Christopherson
Update the vCPU's local (virtual) APIC on LDR writes even if the write "fails". The APIC needs to recalc the optimized logical map even if the LDR is invalid or zero, e.g. if the guest clears its LDR, the optimized map will be left as is and the vCPU will receive interrupts using its old LDR. Fixes: 18f40c53e10f ("svm: Add VMEXIT handlers for AVIC") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: SVM: Inhibit AVIC if vCPUs are aliased in logical modeSean Christopherson
Inhibit SVM's AVIC if multiple vCPUs are aliased to the same logical ID. Architecturally, all CPUs whose logical ID matches the MDA are supposed to receive the interrupt; overwriting existing entries in AVIC's logical=>physical map can result in missed IPIs. Fixes: 18f40c53e10f ("svm: Add VMEXIT handlers for AVIC") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Inhibit APICv/AVIC if the optimized physical map is disabledSean Christopherson
Inhibit APICv/AVIC if the optimized physical map is disabled so that KVM KVM provides consistent APIC behavior if xAPIC IDs are aliased due to vcpu_id being truncated and the x2APIC hotplug hack isn't enabled. If the hotplug hack is disabled, events that are emulated by KVM will follow architectural behavior (all matching vCPUs receive events, even if the "match" is due to truncation), whereas APICv and AVIC will deliver events only to the first matching vCPU, i.e. the vCPU that matches without truncation. Note, the "extra" inhibit is needed because KVM deliberately ignores mismatches due to truncation when applying the APIC_ID_MODIFIED inhibit so that large VMs (>255 vCPUs) can run with APICv/AVIC. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDsSean Christopherson
Apply KVM's hotplug hack if and only if userspace has enabled 32-bit IDs for x2APIC. If 32-bit IDs are not enabled, disable the optimized map to honor x86 architectural behavior if multiple vCPUs shared a physical APIC ID. As called out in the changelog that added the hack, all CPUs whose (possibly truncated) APIC ID matches the target are supposed to receive the IPI. KVM intentionally differs from real hardware, because real hardware (Knights Landing) does just "x2apic_id & 0xff" to decide whether to accept the interrupt in xAPIC mode and it can deliver one interrupt to more than one physical destination, e.g. 0x123 to 0x123 and 0x23. Applying the hack even when x2APIC is not fully enabled means KVM doesn't correctly handle scenarios where the guest has aliased xAPIC IDs across multiple vCPUs, as only the vCPU with the lowest vCPU ID will receive any interrupts. It's extremely unlikely any real world guest aliases APIC IDs, or even modifies APIC IDs, but KVM's behavior is arbitrary, e.g. the lowest vCPU ID "wins" regardless of which vCPU is "aliasing" and which vCPU is "normal". Furthermore, the hack is _not_ guaranteed to work! The hack works if and only if the optimized APIC map is successfully allocated. If the map allocation fails (unlikely), KVM will fall back to its unoptimized behavior, which _does_ honor the architectural behavior. Pivot on 32-bit x2APIC IDs being enabled as that is required to take advantage of the hotplug hack (see kvm_apic_state_fixup()), i.e. won't break existing setups unless they are way, way off in the weeds. And an entry in KVM's errata to document the hack. Alternatively, KVM could provide an actual x2APIC quirk and document the hack that way, but there's unlikely to ever be a use case for disabling the quirk. Go the errata route to avoid having to validate a quirk no one cares about. Fixes: 5bd5db385b3e ("KVM: x86: allow hotplug of VCPU with APIC ID over 0xff") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Disable APIC logical map if vCPUs are aliased in logical modeSean Christopherson
Disable the optimized APIC logical map if multiple vCPUs are aliased to the same logical ID. Architecturally, all CPUs whose logical ID matches the MDA are supposed to receive the interrupt; overwriting existing map entries can result in missed IPIs. Fixes: 1e08ec4a130e ("KVM: optimize apic interrupt delivery") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Disable APIC logical map if logical ID covers multiple MDAsSean Christopherson
Disable the optimized APIC logical map if a logical ID covers multiple MDAs, i.e. if a vCPU has multiple bits set in its ID. In logical mode, events match if "ID & MDA != 0", i.e. creating an entry for only the first bit can cause interrupts to be missed. Note, creating an entry for every bit is also wrong as KVM would generate IPIs for every matching bit. It would be possible to teach KVM to play nice with this edge case, but it is very much an edge case and probably not used in any real world OS, i.e. it's not worth optimizing. Fixes: 1e08ec4a130e ("KVM: optimize apic interrupt delivery") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Skip redundant x2APIC logical mode optimized cluster setupSean Christopherson
Skip the optimized cluster[] setup for x2APIC logical mode, as KVM reuses the optimized map's phys_map[] and doesn't actually need to insert the target apic into the cluster[]. The LDR is derived from the x2APIC ID, and both are read-only in KVM, thus the vCPU's cluster[ldr] is guaranteed to be the same entry as the vCPU's phys_map[x2apic_id] entry. Skipping the unnecessary setup will allow a future fix for aliased xAPIC logical IDs to simply require that cluster[ldr] is non-NULL, i.e. won't have to special case x2APIC. Alternatively, the future check could allow "cluster[ldr] == apic", but that ends up being terribly confusing because cluster[ldr] is only set at the very end, i.e. it's only possible due to x2APIC's shenanigans. Another alternative would be to send x2APIC down a separate path _after_ the calculation and then assert that all of the above, but the resulting code is rather messy, and it's arguably unnecessary since asserting that the actual LDR matches the expected LDR means that simply testing that interrupts are delivered correctly provides the same guarantees. Reported-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Explicitly track all possibilities for APIC map's logical modesSean Christopherson
Track all possibilities for the optimized APIC map's logical modes instead of overloading the pseudo-bitmap and treating any "unknown" value as "invalid". As documented by the now-stale comment above the mode values, the values did have meaning when the optimized map was originally added. That dependent logical was removed by commit e45115b62f9a ("KVM: x86: use physical LAPIC array for logical x2APIC"), but the obfuscated behavior and its comment were left behind. Opportunistically rename "mode" to "logical_mode", partly to make it clear that the "disabled" case applies only to the logical map, but also to prove that there is no lurking code that expects "mode" to be a bitmap. Functionally, this is a glorified nop. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20230106011306.85230-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13KVM: x86: Explicitly skip optimized logical map setup if vCPU's LDR==0Sean Christopherson
Explicitly skip the optimized map setup if the vCPU's LDR is '0', i.e. if the vCPU will never respond to logical mode interrupts. KVM already skips setup in this case, but relies on kvm_apic_map_get_logical_dest() to generate mask==0. KVM still needs the mask=0 check as a non-zero LDR can yield mask==0 depending on the mode, but explicitly handling the LDR will make it simpler to clean up the logical mode tracking in the future. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>