summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
2022-04-02KVM: x86: Use gfn_to_pfn_cache for pv_timeDavid Woodhouse
Add a new kvm_setup_guest_pvclock() which parallels the existing kvm_setup_pvclock_page(). The latter will be removed once we convert all users to the gfn_to_pfn_cache version. Using the new cache, we can potentially let kvm_set_guest_paused() set the PVCLOCK_GUEST_STOPPED bit directly rather than having to delegate to the vCPU via KVM_REQ_CLOCK_UPDATE. But not yet. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-5-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/xen: Use gfn_to_pfn_cache for runstate areaDavid Woodhouse
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-4-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Allow userspace to opt out of hypercall patchingOliver Upton
KVM handles the VMCALL/VMMCALL instructions very strangely. Even though both of these instructions really should #UD when executed on the wrong vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the guest's instruction with the appropriate instruction for the vendor. Nonetheless, older guest kernels without commit c1118b3602c2 ("x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only") do not patch in the appropriate instruction using alternatives, likely motivating KVM's intervention. Add a quirk allowing userspace to opt out of hypercall patching. If the quirk is disabled, KVM synthesizes a #UD in the guest. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220316005538.2282772-2-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: SVM: fix tsc scaling when the host doesn't support itMaxim Levitsky
It was decided that when TSC scaling is not supported, the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0' value. However in this case kvm_max_tsc_scaling_ratio is not set, which breaks various assumptions. Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of host support. For consistency, do the same for VMX. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsrHou Wenlong
If MSR access is rejected by MSR filtering, kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED, and the return value is only handled well for rdmsr/wrmsr. However, some instruction emulation and state transition also use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger some unexpected results if MSR access is rejected, E.g. RDPID emulation would inject a #UD but RDPID wouldn't cause a exit when RDPID is supported in hardware and ENABLE_RDTSCP is set. And it would also cause failure when load MSR at nested entry/exit. Since msr filtering is based on MSR bitmap, it is better to only do MSR filtering for rdmsr/wrmsr. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/emulator: Emulate RDPID only if it is enabled in guestHou Wenlong
When RDTSCP is supported but RDPID is not supported in host, RDPID emulation is available. However, __kvm_get_msr() would only fail when RDTSCP/RDPID both are disabled in guest, so the emulator wouldn't inject a #UD when RDPID is disabled but RDTSCP is enabled in guest. Fixes: fb6d4d340e05 ("KVM: x86: emulate RDPID") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Trace all APICv inhibit changes and capture overall statusSean Christopherson
Trace all APICv inhibit changes instead of just those that result in APICv being (un)inhibited, and log the current state. Debugging why APICv isn't working is frustrating as it's hard to see why APICv is still inhibited, and logging only the first inhibition means unnecessary onion peeling. Opportunistically drop the export of the tracepoint, it is not and should not be used by vendor code due to the need to serialize toggling via apicv_update_lock. Note, using the common flow means kvm_apicv_init() switched from atomic to non-atomic bitwise operations. The VM is unreachable at init, so non-atomic is perfectly ok. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Add wrappers for setting/clearing APICv inhibitsSean Christopherson
Add set/clear wrappers for toggling APICv inhibits to make the call sites more readable, and opportunistically rename the inner helpers to align with the new wrappers and to make them more readable as well. Invert the flag from "activate" to "set"; activate is painfully ambiguous as it's not obvious if the inhibit is being activated, or if APICv is being activated, in which case the inhibit is being deactivated. For the functions that take @set, swap the order of the inhibit reason and @set so that the call sites are visually similar to those that bounce through the wrapper. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Make APICv inhibit reasons an enum and cleanup namingSean Christopherson
Use an enum for the APICv inhibit reasons, there is no meaning behind their values and they most definitely are not "unsigned longs". Rename the various params to "reason" for consistency and clarity (inhibit may be confused as a command, i.e. inhibit APICv, instead of the reason that is getting toggled/checked). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Handle implicit supervisor access with SMAPLai Jiangshan
There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Change the type of access u32 to u64Lai Jiangshan
Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: MMU: propagate alloc_workqueue failurePaolo Bonzini
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm. kvm_arch_init_vm also has to undo all the initialization, so group all the MMU initialization code at the beginning and handle cleaning up of kvm_page_track_init. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-24Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Proper emulation of the OSLock feature of the debug architecture - Scalibility improvements for the MMU lock when dirty logging is on - New VMID allocator, which will eventually help with SVA in VMs - Better support for PMUs in heterogenous systems - PSCI 1.1 support, enabling support for SYSTEM_RESET2 - Implement CONFIG_DEBUG_LIST at EL2 - Make CONFIG_ARM64_ERRATUM_2077057 default y - Reduce the overhead of VM exit when no interrupt is pending - Remove traces of 32bit ARM host support from the documentation - Updated vgic selftests - Various cleanups, doc updates and spelling fixes RISC-V: - Prevent KVM_COMPAT from being selected - Optimize __kvm_riscv_switch_to() implementation - RISC-V SBI v0.3 support s390: - memop selftest - fix SCK locking - adapter interruptions virtualization for secure guests - add Claudio Imbrenda as maintainer - first step to do proper storage key checking x86: - Continue switching kvm_x86_ops to static_call(); introduce static_call_cond() and __static_call_ret0 when applicable. - Cleanup unused arguments in several functions - Synthesize AMD 0x80000021 leaf - Fixes and optimization for Hyper-V sparse-bank hypercalls - Implement Hyper-V's enlightened MSR bitmap for nested SVM - Remove MMU auditing - Eager splitting of page tables (new aka "TDP" MMU only) when dirty page tracking is enabled - Cleanup the implementation of the guest PGD cache - Preparation for the implementation of Intel IPI virtualization - Fix some segment descriptor checks in the emulator - Allow AMD AVIC support on systems with physical APIC ID above 255 - Better API to disable virtualization quirks - Fixes and optimizations for the zapping of page tables: - Zap roots in two passes, avoiding RCU read-side critical sections that last too long for very large guests backed by 4 KiB SPTEs. - Zap invalid and defunct roots asynchronously via concurrency-managed work queue. - Allowing yielding when zapping TDP MMU roots in response to the root's last reference being put. - Batch more TLB flushes with an RCU trick. Whoever frees the paging structure now holds RCU as a proxy for all vCPUs running in the guest, i.e. to prolongs the grace period on their behalf. It then kicks the the vCPUs out of guest mode before doing rcu_read_unlock(). Generic: - Introduce __vcalloc and use it for very large allocations that need memcg accounting" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) KVM: use kvcalloc for array allocations KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 kvm: x86: Require const tsc for RT KVM: x86: synthesize CPUID leaf 0x80000021h if useful KVM: x86: add support for CPUID leaf 0x80000021 KVM: x86: do not use KVM_X86_OP_OPTIONAL_RET0 for get_mt_mask Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()" kvm: x86/mmu: Flush TLB before zap_gfn_range releases RCU KVM: arm64: fix typos in comments KVM: arm64: Generalise VM features into a set of flags KVM: s390: selftests: Add error memop tests KVM: s390: selftests: Add more copy memop tests KVM: s390: selftests: Add named stages for memop test KVM: s390: selftests: Add macro as abstraction for MEM_OP KVM: s390: selftests: Split memop tests KVM: s390x: fix SCK locking RISC-V: KVM: Implement SBI HSM suspend call RISC-V: KVM: Add common kvm_riscv_vcpu_wfi() function RISC-V: Add SBI HSM suspend related defines RISC-V: KVM: Implement SBI v0.3 SRST extension ...
2022-03-22Merge tag 'sched-core-2022-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Cleanups for SCHED_DEADLINE - Tracing updates/fixes - CPU Accounting fixes - First wave of changes to optimize the overhead of the scheduler build, from the fast-headers tree - including placeholder *_api.h headers for later header split-ups. - Preempt-dynamic using static_branch() for ARM64 - Isolation housekeeping mask rework; preperatory for further changes - NUMA-balancing: deal with CPU-less nodes - NUMA-balancing: tune systems that have multiple LLC cache domains per node (eg. AMD) - Updates to RSEQ UAPI in preparation for glibc usage - Lots of RSEQ/selftests, for same - Add Suren as PSI co-maintainer * tag 'sched-core-2022-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits) sched/headers: ARM needs asm/paravirt_api_clock.h too sched/numa: Fix boot crash on arm64 systems headers/prep: Fix header to build standalone: <linux/psi.h> sched/headers: Only include <linux/entry-common.h> when CONFIG_GENERIC_ENTRY=y cgroup: Fix suspicious rcu_dereference_check() usage warning sched/preempt: Tell about PREEMPT_DYNAMIC on kernel headers sched/topology: Remove redundant variable and fix incorrect type in build_sched_domains sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity() sched/deadline,rt: Remove unused functions for !CONFIG_SMP sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistently sched/deadline: Merge dl_task_can_attach() and dl_cpu_busy() sched/deadline: Move bandwidth mgmt and reclaim functions into sched class source file sched/deadline: Remove unused def_dl_bandwidth sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE sched/tracing: Don't re-read p->state when emitting sched_switch event sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race sched/cpuacct: Remove redundant RCU read lock sched/cpuacct: Optimize away RCU read lock sched/cpuacct: Fix charge percpu cpuusage sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies ...
2022-03-22Merge tag 'perf-core-2022-03-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf event updates from Ingo Molnar: - Fix address filtering for Intel/PT,ARM/CoreSight - Enable Intel/PEBS format 5 - Allow more fixed-function counters for x86 - Intel/PT: Enable not recording Taken-Not-Taken packets - Add a few branch-types * tag 'perf-core-2022-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Fix the build on !CONFIG_PHYS_ADDR_T_64BIT perf: Add irq and exception return branch types perf/x86/intel/uncore: Make uncore_discovery clean for 64 bit addresses perf/x86/intel/pt: Add a capability and config bit for disabling TNTs perf/x86/intel/pt: Add a capability and config bit for event tracing perf/x86/intel: Increase max number of the fixed counters KVM: x86: use the KVM side max supported fixed counter perf/x86/intel: Enable PEBS format 5 perf/core: Allow kernel address filter when not filtering the kernel perf/x86/intel/pt: Fix address filter config for 32-bit kernel perf/core: Fix address filter parser for multiple filters x86: Share definition of __is_canonical_address() perf/x86/intel/pt: Relax address filter validation
2022-03-21KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2Oliver Upton
KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not advertise the set of quirks which may be disabled to userspace, so it is impossible to predict the behavior of KVM. Worse yet, KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning it fails to reject attempts to set invalid quirk bits. The only valid workaround for the quirky quirks API is to add a new CAP. Actually advertise the set of quirks that can be disabled to userspace so it can predict KVM's behavior. Reject values for cap->args[0] that contain invalid bits. Finally, add documentation for the new capability and describe the existing quirks. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220301060351.442881-5-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-21kvm: x86: Require const tsc for RTThomas Gleixner
Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Message-Id: <Yh5eJSG19S2sjZfy@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-15Merge tag 'v5.17-rc8' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-08KVM: use __vcalloc for very large allocationsPaolo Bonzini
Allocations whose size is related to the memslot size can be arbitrarily large. Do not use kvzalloc/kvcalloc, as those are limited to "not crazy" sizes that fit in 32 bits. Cc: stable@vger.kernel.org Fixes: 7661809d493b ("mm: don't allow oversized kvmalloc() calls") Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-04Merge branch 'kvm-bugfixes' into HEADPaolo Bonzini
Merge bugfixes from 5.17 before merging more tricky work.
2022-03-02KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_runPaolo Bonzini
kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two places, namely vcpu_run and post_kvm_run_save, and a third is actually needed around the call to vcpu->arch.complete_userspace_io to avoid the following splat: WARNING: suspicious RCU usage arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by CPU 28/KVM/370841: #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm] Call Trace: <TASK> dump_stack_lvl+0x59/0x73 reprogram_fixed_counter+0x15d/0x1a0 [kvm] kvm_pmu_trigger_event+0x1a3/0x260 [kvm] ? free_moved_vector+0x1b4/0x1e0 complete_fast_pio_in+0x8a/0xd0 [kvm] This splat is not at all unexpected, since complete_userspace_io callbacks can execute similar code to vmexits. For example, SVM with nrips=false will call into the emulator from svm_skip_emulated_instruction(). While it's tempting to never acquire kvm->srcu for an uninitialized vCPU, practically speaking there's no penalty to acquiring kvm->srcu "early" as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU. On the other hand, seemingly innocuous helpers like kvm_apic_accept_events() and sync_regs() can theoretically reach code that might access SRCU-protected data structures, e.g. sync_regs() can trigger forced existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events(). Reported-by: Like Xu <likexu@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-01KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zappedSean Christopherson
Zap only obsolete roots when responding to zapping a single root shadow page. Because KVM keeps root_count elevated when stuffing a previous root into its PGD cache, shadowing a 64-bit guest means that zapping any root causes all vCPUs to reload all roots, even if their current root is not affected by the zap. For many kernels, zapping a single root is a frequent operation, e.g. in Linux it happens whenever an mm is dropped, e.g. process exits, etc... Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-01KVM: x86: Invoke kvm_mmu_unload() directly on CR4.PCIDE changeSean Christopherson
Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call when the guest's CR4.PCIDE changes. This will allow tweaking the logic of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the historical intent of KVM_REQ_MMU_RELOAD. The recent PCIDE behavior is the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as obsolete, needs to unconditionally unload the entire MMU, _and_ affects only the current vCPU. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220225182248.3812651-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-01KVM: x86: Make kvm_lapic_set_reg() a "private" xAPIC helperSean Christopherson
Hide the lapic's "raw" write helper inside lapic.c to force non-APIC code to go through proper helpers when modification the vAPIC state. Keep the read helper visible to outsiders for now, refactoring KVM to hide it too is possible, it will just take more work to do so. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-01KVM: x86: WARN if KVM emulates an IPI without clearing the BUSY flagSean Christopherson
WARN if KVM emulates an IPI without clearing the BUSY flag, failure to do so could hang the guest if it waits for the IPI be sent. Opportunistically use APIC_ICR_BUSY macro instead of open coding the magic number, and add a comment to clarify why kvm_recalculate_apic_map() is unconditionally invoked (it's really, really confusing for IPIs due to the existence of fast paths that don't trigger a potential recalc). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220204214205.3306634-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-01KVM: x86: flush TLB separately from MMU resetPaolo Bonzini
For both CR0 and CR4, disassociate the TLB flush logic from the MMU role logic. Instead of relying on kvm_mmu_reset_context() being a superset of various TLB flushes (which is not necessarily going to be the case in the future), always call it if the role changes but also set the various TLB flush requests according to what is in the manual. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86/mmu: do not pass vcpu to root freeing functionsPaolo Bonzini
These functions only operate on a given MMU, of which there is more than one in a vCPU (we care about two, because the third does not have any roots and is only used to walk guest page tables). They do need a struct kvm in order to lock the mmu_lock, but they do not needed anything else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: use struct kvm_mmu_root_info for mmu->rootPaolo Bonzini
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: do not deliver asynchronous page faults if CR0.PG=0Paolo Bonzini
Enabling async page faults is nonsensical if paging is disabled, but it is allowed because CR0.PG=0 does not clear the async page fault MSR. Just ignore them and only use the artificial halt state, similar to what happens in guest mode if async #PF vmexits are disabled. Given the increasingly complex logic, and the nicer code if the new "if" is placed last, opportunistically change the "||" into a chain of "if (...) return false" statements. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: Reinitialize context if host userspace toggles EFER.LMEPaolo Bonzini
While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and therefore EFER.NX is the only bit that can affect the MMU role. However, set_efer accepts a host-initiated change to EFER.LME even with CR0.PG=1. In that case, the MMU has to be reset. Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: Provide per VM capability for disabling PMU virtualizationDavid Dunn
Add a new capability, KVM_CAP_PMU_CAPABILITY, that takes a bitmask of settings/features to allow userspace to configure PMU virtualization on a per-VM basis. For now, support a single flag, KVM_PMU_CAP_DISABLE, to allow disabling PMU virtualization for a VM even when KVM is configured with enable_pmu=true a module level. To keep KVM simple, disallow changing VM's PMU configuration after vCPUs have been created. Signed-off-by: David Dunn <daviddunn@google.com> Message-Id: <20220223225743.2703915-2-daviddunn@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-21Merge tag 'v5.17-rc5' into sched/core, to resolve conflictsIngo Molnar
New conflicts in sched/core due to the following upstream fixes: 44585f7bc0cb ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n") a06247c6804f ("psi: Fix uaf issue when psi trigger is destroyed while being polled") Conflicts: include/linux/psi_types.h kernel/sched/psi.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-02-18KVM: x86: allow defining return-0 static callsPaolo Bonzini
A few vendor callbacks are only used by VMX, but they return an integer or bool value. Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is NULL in struct kvm_x86_ops, it will be changed to __static_call_return0 when updating static calls. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: make several APIC virtualization callbacks optionalPaolo Bonzini
All their invocations are conditional on vcpu->arch.apicv_active, meaning that they need not be implemented by vendor code: even though at the moment both vendors implement APIC virtualization, all of them can be optional. In fact SVM does not need many of them, and their implementation can be deleted now. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: remove KVM_X86_OP_NULL and mark optional kvm_x86_opsPaolo Bonzini
The original use of KVM_X86_OP_NULL, which was to mark calls that do not follow a specific naming convention, is not in use anymore. Instead, let's mark calls that are optional because they are always invoked within conditionals or with static_call_cond. Those that are _not_, i.e. those that are defined with KVM_X86_OP, must be defined by both vendor modules or some kind of NULL pointer dereference is bound to happen at runtime. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: use static_call_cond for optional callbacksPaolo Bonzini
SVM implements neither update_emulated_instruction nor set_apic_access_page_addr. Remove an "if" by calling them with static_call_cond(). Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: return 1 unconditionally for availability of KVM_CAP_VAPICPaolo Bonzini
The two ioctls used to implement userspace-accelerated TPR, KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available even if hardware-accelerated TPR can be used. So there is no reason not to report KVM_CAP_VAPIC. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18x86/kvm: Fix compilation warning in non-x86_64 buildsLeonardo Bras
On non-x86_64 builds, helpers gtod_is_based_on_tsc() and kvm_guest_supported_xfd() are defined but never used. Because these are static inline but are in a .c file, some compilers do warn for them with -Wunused-function, which becomes an error if -Werror is present. Add #ifdef so they are only defined in x86_64 builds. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Leonardo Bras <leobras@redhat.com> Message-Id: <20220218034100.115702-1-leobras@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0Leonardo Bras
kvm_vcpu_arch currently contains the guest supported features in both guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field. Currently both fields are set to the same value in kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that. Since it's not good to keep duplicated data, remove guest_supported_xcr0. To keep the code more readable, introduce kvm_guest_supported_xcr() and kvm_guest_supported_xfd() to replace the previous usages of guest_supported_xcr0. Signed-off-by: Leonardo Bras <leobras@redhat.com> Message-Id: <20220217053028.96432-3-leobras@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup modeAnton Romanov
If vcpu has tsc_always_catchup set each request updates pvclock data. KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on host's side and do hypercall inside pvclock_read_retry loop leading to infinite loop in such situation. v3: Removed warn Changed return code to KVM_EFAULT v2: Added warn Signed-off-by: Anton Romanov <romanton@google.com> Message-Id: <20220216182653.506850-1-romanton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17KVM: x86: Add KVM_CAP_ENABLE_CAP to x86Aaron Lewis
Follow the precedent set by other architectures that support the VCPU ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP. This way, userspace can ensure that KVM_ENABLE_CAP is available on a vcpu before using it. Fixes: 5c919412fe61 ("kvm/x86: Hyper-V synthetic interrupt controller") Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220214212950.1776943-1-aaronlewis@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-16sched/isolation: Use single feature type while referring to housekeeping cpumaskFrederic Weisbecker
Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2022-02-14Merge remote-tracking branch 'kvm/master' into HEADPaolo Bonzini
Merge bugfix patches from Linux 5.17-rc.
2022-02-11KVM: SVM: fix race between interrupt delivery and AVIC inhibitionMaxim Levitsky
If svm_deliver_avic_intr is called just after the target vcpu's AVIC got inhibited, it might read a stale value of vcpu->arch.apicv_active which can lead to the target vCPU not noticing the interrupt. To fix this use load-acquire/store-release so that, if the target vCPU is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the AVIC. If AVIC has been disabled in the meanwhile, proceed with the KVM_REQ_EVENT-based delivery. Incomplete IPI vmexit has the same races as svm_deliver_avic_intr, and in fact it can be handled in exactly the same way; the only difference lies in who has set IRR, whether svm_deliver_interrupt or the processor. Therefore, svm_complete_interrupt_delivery can be used to fix incomplete IPI vmexits as well. Co-developed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOGDavid Matlack
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for the first time and only for the specific sub-region being cleared. Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of huge page splitting across multiple ioctls, rather than splitting the entire memslot as is the case when initially-all-set is not used. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-17-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is ↵David Matlack
enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooksSean Christopherson
Use slightly more verbose names for the so called "memory encrypt", a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current super short kvm_x86_ops names and SVM's more verbose, but non-conforming names. This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP() to fill svm_x86_ops. Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better reflect its true nature, as it really is a full fledged ioctl() of its own. Ideally, the hook would be named confidential_vm_ioctl() or so, as the ioctl() is a gateway to more than just memory encryption, and because its underlying purpose to support Confidential VMs, which can be provided without memory encryption, e.g. if the TCB of the guest includes the host kernel but not host userspace, or by isolation in hardware without encrypting memory. But, diverging from KVM_MEMORY_ENCRYPT_OP even further is undeseriable, and short of creating alises for all related ioctl()s, which introduces a different flavor of divergence, KVM is stuck with the nomenclature. Defer renaming SVM's functions to a future commit as there are additional changes needed to make SVM fully conforming and to match reality (looking at you, svm_vm_copy_asid_from()). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86: Move get_cs_db_l_bits() helper to SVMSean Christopherson
Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a superfluous export from KVM x86. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86: Use static_call() for copy/move encryption context ioctls()Sean Christopherson
Define and use static_call()s for .vm_{copy,move}_enc_context_from(), mostly so that the op is defined in kvm-x86-ops.h. This will allow using KVM_X86_OP in vendor code to wire up the implementation. Any performance gains eeked out by using static_call() is a happy bonus and not the primary motiviation. Opportunistically refactor the code to reduce indentation and keep line lengths reasonable, and to be consistent when wrapping versus running a bit over the 80 char soft limit. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86: Unexport kvm_x86_opsSean Christopherson
Drop the export of kvm_x86_ops now it is no longer referenced by SVM or VMX. Disallowing access to kvm_x86_ops is very desirable as it prevents vendor code from incorrectly modifying hooks after they have been set by kvm_arch_hardware_setup(), and more importantly after each function's associated static_call key has been updated. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>