summaryrefslogtreecommitdiff
path: root/virt/kvm/kvm_main.c
AgeCommit message (Collapse)Author
2025-07-17KVM: Allow CPU to reschedule while setting per-page memory attributesLiam Merwick
commit 47bb584237cc285e3a860b70c01f7bda9dcfb05b upstream. When running an SEV-SNP guest with a sufficiently large amount of memory (1TB+), the host can experience CPU soft lockups when running an operation in kvm_vm_set_mem_attributes() to set memory attributes on the whole range of guest memory. watchdog: BUG: soft lockup - CPU#8 stuck for 26s! [qemu-kvm:6372] CPU: 8 UID: 0 PID: 6372 Comm: qemu-kvm Kdump: loaded Not tainted 6.15.0-rc7.20250520.el9uek.rc1.x86_64 #1 PREEMPT(voluntary) Hardware name: Oracle Corporation ORACLE SERVER E4-2c/Asm,MB Tray,2U,E4-2c, BIOS 78016600 11/13/2024 RIP: 0010:xas_create+0x78/0x1f0 Code: 00 00 00 41 80 fc 01 0f 84 82 00 00 00 ba 06 00 00 00 bd 06 00 00 00 49 8b 45 08 4d 8d 65 08 41 39 d6 73 20 83 ed 06 48 85 c0 <74> 67 48 89 c2 83 e2 03 48 83 fa 02 75 0c 48 3d 00 10 00 00 0f 87 RSP: 0018:ffffad890a34b940 EFLAGS: 00000286 RAX: ffff96f30b261daa RBX: ffffad890a34b9c8 RCX: 0000000000000000 RDX: 000000000000001e RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffad890a356868 R13: ffffad890a356860 R14: 0000000000000000 R15: ffffad890a356868 FS: 00007f5578a2a400(0000) GS:ffff97ed317e1000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f015c70fb18 CR3: 00000001109fd006 CR4: 0000000000f70ef0 PKRU: 55555554 Call Trace: <TASK> xas_store+0x58/0x630 __xa_store+0xa5/0x130 xa_store+0x2c/0x50 kvm_vm_set_mem_attributes+0x343/0x710 [kvm] kvm_vm_ioctl+0x796/0xab0 [kvm] __x64_sys_ioctl+0xa3/0xd0 do_syscall_64+0x8c/0x7a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f5578d031bb Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d 4c 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe0a742b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 000000004020aed2 RCX: 00007f5578d031bb RDX: 00007ffe0a742c80 RSI: 000000004020aed2 RDI: 000000000000000b RBP: 0000010000000000 R08: 0000010000000000 R09: 0000017680000000 R10: 0000000000000080 R11: 0000000000000246 R12: 00005575e5f95120 R13: 00007ffe0a742c80 R14: 0000000000000008 R15: 00005575e5f961e0 While looping through the range of memory setting the attributes, call cond_resched() to give the scheduler a chance to run a higher priority task on the runqueue if necessary and avoid staying in kernel mode long enough to trigger the lockup. Fixes: 5a475554db1e ("KVM: Introduce per-page memory attributes") Cc: stable@vger.kernel.org # 6.12.x Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250609091121.2497429-2-liam.merwick@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22KVM: Add member to struct kvm_gfn_range to indicate private/sharedIsaku Yamahata
[ Upstream commit dca6c88532322830d5d92486467fcc91b67a9ad8 ] Add new members to strut kvm_gfn_range to indicate which mapping (private-vs-shared) to operate on: enum kvm_gfn_range_filter attr_filter. Update the core zapping operations to set them appropriately. TDX utilizes two GPA aliases for the same memslots, one for memory that is for private memory and one that is for shared. For private memory, KVM cannot always perform the same operations it does on memory for default VMs, such as zapping pages and having them be faulted back in, as this requires guest coordination. However, some operations such as guest driven conversion of memory between private and shared should zap private memory. Internally to the MMU, private and shared mappings are tracked on separate roots. Mapping and zapping operations will operate on the respective GFN alias for each root (private or shared). So zapping operations will by default zap both aliases. Add fields in struct kvm_gfn_range to allow callers to specify which aliases so they can only target the aliases appropriate for their specific operation. There was feedback that target aliases should be specified such that the default value (0) is to operate on both aliases. Several options were considered. Several variations of having separate bools defined such that the default behavior was to process both aliases. They either allowed nonsensical configurations, or were confusing for the caller. A simple enum was also explored and was close, but was hard to process in the caller. Instead, use an enum with the default value (0) reserved as a disallowed value. Catch ranges that didn't have the target aliases specified by looking for that specific value. Set target alias with enum appropriately for these MMU operations: - For KVM's mmu notifier callbacks, zap shared pages only because private pages won't have a userspace mapping - For setting memory attributes, kvm_arch_pre_set_memory_attributes() chooses the aliases based on the attribute. - For guest_memfd invalidations, zap private only. Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: 9129633d568e ("KVM: x86/mmu: Prevent installing hugepages when mem attributes are changing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05KVM: x86: switch hugepage recovery thread to vhost_taskPaolo Bonzini
commit d96c77bd4eeba469bddbbb14323d2191684da82a upstream. kvm_vm_create_worker_thread() is meant to be used for kthreads that can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory). Therefore it wants to charge the CPU time consumed by that work to the VM's container. However, because of these threads, cgroups which have kvm instances inside never complete freezing. This can be trivially reproduced: root@test ~# mkdir /sys/fs/cgroup/test root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs root@test ~# qemu-system-x86_64 -nographic -enable-kvm and in another terminal: root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze root@test ~# cat /sys/fs/cgroup/test/cgroup.events populated 1 frozen 0 The cgroup freezing happens in the signal delivery path but kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never calls into the signal delivery path and thus never gets frozen. Because the cgroup freezer determines whether a given cgroup is frozen by comparing the number of frozen threads to the total number of threads in the cgroup, the cgroup never becomes frozen and users waiting for the state transition may hang indefinitely. Since the worker kthread is tied to a user process, it's better if it behaves similarly to user tasks as much as possible, including being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted to be and more: not only it inherits the userspace process's cgroups, it has other niceties like being parented properly in the process tree. Use it instead of the homegrown alternative. Incidentally, the new code is also better behaved when you flip recovery back and forth to disabled and back to enabled. If your recovery period is 1 minute, it will run the next recovery after 1 minute independent of how many times you flipped the parameter. (Commit message based on emails from Tejun). Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Luca Boccassi <bluca@debian.org> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Luca Boccassi <bluca@debian.org> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix the guest view of the ID registers, making the relevant fields writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1) - Correcly expose S1PIE to guests, fixing a regression introduced in 6.12-rc1 with the S1POE support - Fix the recycling of stage-2 shadow MMUs by tracking the context (are we allowed to block or not) as well as the recycling state - Address a couple of issues with the vgic when userspace misconfigures the emulation, resulting in various splats. Headaches courtesy of our Syzkaller friends - Stop wasting space in the HYP idmap, as we are dangerously close to the 4kB limit, and this has already exploded in -next - Fix another race in vgic_init() - Fix a UBSAN error when faking the cache topology with MTE enabled RISCV: - RISCV: KVM: use raw_spinlock for critical section in imsic x86: - A bandaid for lack of XCR0 setup in selftests, which causes trouble if the compiler is configured to have x86-64-v3 (with AVX) as the default ISA. Proper XCR0 setup will come in the next merge window. - Fix an issue where KVM would not ignore low bits of the nested CR3 and potentially leak up to 31 bytes out of the guest memory's bounds - Fix case in which an out-of-date cached value for the segments could by returned by KVM_GET_SREGS. - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL - Override MTRR state for KVM confidential guests, making it WB by default as is already the case for Hyper-V guests. Generic: - Remove a couple of unused functions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits) RISCV: KVM: use raw_spinlock for critical section in imsic KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups KVM: selftests: x86: Avoid using SSE/AVX instructions KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset() KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot x86/kvm: Override default caching mode for SEV-SNP and TDX KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic KVM: Remove unused kvm_vcpu_gfn_to_pfn KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS KVM: arm64: Fix shift-out-of-bounds bug KVM: arm64: Shave a few bytes from the EL2 idmap code KVM: arm64: Don't eagerly teardown the vgic on init error KVM: arm64: Expose S1PIE to guests KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule KVM: arm64: nv: Punt stage-2 recycling to a vCPU request KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed ...
2024-10-20KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomicDr. David Alan Gilbert
The last use of kvm_vcpu_gfn_to_pfn_atomic was removed by commit 1bbc60d0c7e5 ("KVM: x86/mmu: Remove MMU auditing") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Message-ID: <20241001141354.18009-3-linux@treblig.org> [Adjust Documentation/virt/kvm/locking.rst. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: Remove unused kvm_vcpu_gfn_to_pfnDr. David Alan Gilbert
The last use of kvm_vcpu_gfn_to_pfn was removed by commit b1624f99aa8f ("KVM: Remove kvm_vcpu_gfn_to_page() and kvm_vcpu_gpa_to_page()") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Message-ID: <20241001141354.18009-2-linux@treblig.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-14sched/fair: Fix external p->on_rq usersPeter Zijlstra
Sean noted that ever since commit 152e11f6df29 ("sched/fair: Implement delayed dequeue") KVM's preemption notifiers have started mis-classifying preemption vs blocking. Notably p->on_rq is no longer sufficient to determine if a task is runnable or blocked -- the aforementioned commit introduces tasks that remain on the runqueue even through they will not run again, and should be considered blocked for many cases. Add the task_is_runnable() helper to classify things and audit all external users of the p->on_rq state. Also add a few comments. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
2024-09-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull x86 kvm updates from Paolo Bonzini: "x86: - KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted. This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter. Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior. - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon) - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work) - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit - Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest) - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area - Remove unnecessary accounting of temporary nested VMCB related allocations - Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible) - Minor SGX fix and a cleanup - Misc cleanups Generic: - Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed - Enable virtualization when KVM is loaded, not right before the first VM is created Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs Selftests: - Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest - Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries - Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath - Misc cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...
2024-09-27[tree-wide] finally take no_llseek outAl Viro
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-17Merge tag 'kvm-x86-generic-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVK generic changes for 6.12: - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase. - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs.
2024-09-17KVM: use follow_pfnmap APIPeter Xu
Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work is done perfectly on the other side (host_pfn_mapping_level()). Link: https://lkml.kernel.org/r/20240826204353.2228736-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09KVM: Harden guest memory APIs against out-of-bounds accessesSean Christopherson
When reading or writing a guest page, WARN and bail if offset+len would result in a read to a different page so that KVM bugs are more likely to be detected, and so that any such bugs are less likely to escalate to an out-of-bounds access. E.g. if userspace isn't using guard pages and the target page is at the end of a memslot. Note, KVM already hardens itself in similar APIs, e.g. in the "cached" variants, it's just the vanilla APIs that are playing with fire. Link: https://lore.kernel.org/r/20240829191413.900740-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: Write the per-page "segment" when clearing (part of) a guest pageSean Christopherson
Pass "seg" instead of "len" when writing guest memory in kvm_clear_guest(), as "seg" holds the number of bytes to write for the current page, while "len" holds the total bytes remaining. Luckily, all users of kvm_clear_guest() are guaranteed to not cross a page boundary, and so the bug is unhittable in the current code base. Fixes: 2f5414423ef5 ("KVM: remove kvm_clear_guest_page") Reported-by: zyr_ms@outlook.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219104 Link: https://lore.kernel.org/r/20240829191413.900740-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-04KVM: Add arch hooks for enabling/disabling virtualizationSean Christopherson
Add arch hooks that are invoked when KVM enables/disable virtualization. x86 will use the hooks to register an "emergency disable" callback, which is essentially an x86-specific shutdown notifier that is used when the kernel is doing an emergency reboot/shutdown/kexec. Add comments for the declarations to help arch code understand exactly when the callbacks are invoked. Alternatively, the APIs themselves could communicate most of the same info, but kvm_arch_pre_enable_virtualization() and kvm_arch_post_disable_virtualization() are a bit cumbersome, and make it a bit less obvious that they are intended to be implemented as a pair. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: Add a module param to allow enabling virtualization when KVM is loadedSean Christopherson
Add an on-by-default module param, enable_virt_at_load, to let userspace force virtualization to be enabled in hardware when KVM is initialized, i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization during KVM initialization allows userspace to avoid the additional latency when creating/destroying the first/last VM (or more specifically, on the 0=>1 and 1=>0 edges of creation/destruction). Now that KVM uses the cpuhp framework to do per-CPU enabling, the latency could be non-trivial as the cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could be problematic for use case that need to spin up VMs quickly. Prior to commit 10474ae8945c ("KVM: Activate Virtualization On Demand"), KVM _unconditionally_ enabled virtualization during load, i.e. there's no fundamental reason KVM needs to dynamically toggle virtualization. These days, the only known argument for not enabling virtualization is to allow KVM to be autoloaded without blocking other out-of-tree hypervisors, and such use cases can simply change the module param, e.g. via command line. Note, the aforementioned commit also mentioned that enabling SVM (AMD's virtualization extensions) can result in "using invalid TLB entries". It's not clear whether the changelog was referring to a KVM bug, a CPU bug, or something else entirely. Regardless, leaving virtualization off by default is not a robust "fix", as any protection provided is lost the instant userspace creates the first VM. Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: Rename arch hooks related to per-CPU virtualization enablingSean Christopherson
Rename the per-CPU hooks used to enable virtualization in hardware to align with the KVM-wide helpers in kvm_main.c, and to better capture that the callbacks are invoked on every online CPU. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: Rename symbols related to enabling virtualization hardwareSean Christopherson
Rename the various functions (and a variable) that enable virtualization to prepare for upcoming changes, and to clean up artifacts of KVM's previous behavior, which required manually juggling locks around kvm_usage_count. Drop the "nolock" qualifier from per-CPU functions now that there are no "nolock" implementations of the "all" variants, i.e. now that calling a non-nolock function from a nolock function isn't confusing (unlike this sentence). Drop "all" from the outer helpers as they no longer manually iterate over all CPUs, and because it might not be obvious what "all" refers to. In lieu of the above qualifiers, append "_cpu" to the end of the functions that are per-CPU helpers for the outer APIs. Opportunistically prepend "kvm" to all functions to help make it clear that they are KVM helpers, but mostly because there's no reason not to. Lastly, use "virtualization" instead of "hardware", because while the functions do enable virtualization in hardware, there are a _lot_ of things that KVM enables in hardware. Defer renaming the arch hooks to future patches, purely to reduce the amount of churn in a single commit. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: Register cpuhp and syscore callbacks when enabling hardwareSean Christopherson
Register KVM's cpuhp and syscore callback when enabling virtualization in hardware instead of registering the callbacks during initialization, and let the CPU up/down framework invoke the inner enable/disable functions. Registering the callbacks during initialization makes things more complex than they need to be, as KVM needs to be very careful about handling races between enabling CPUs being onlined/offlined and hardware being enabled/disabled. Intel TDX support will require KVM to enable virtualization during KVM initialization, i.e. will add another wrinkle to things, at which point sorting out the potential races with kvm_usage_count would become even more complex. Note, using the cpuhp framework has a subtle behavioral change: enabling will be done serially across all CPUs, whereas KVM currently sends an IPI to all CPUs in parallel. While serializing virtualization enabling could create undesirable latency, the issue is limited to the 0=>1 transition of VM creation. And even that can be mitigated, e.g. by letting userspace force virtualization to be enabled when KVM is initialized. Cc: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlockSean Christopherson
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock on x86 due to a chain of locks and SRCU synchronizations. Translating the below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the fairness of r/w semaphores). CPU0 CPU1 CPU2 1 lock(&kvm->slots_lock); 2 lock(&vcpu->mutex); 3 lock(&kvm->srcu); 4 lock(cpu_hotplug_lock); 5 lock(kvm_lock); 6 lock(&kvm->slots_lock); 7 lock(cpu_hotplug_lock); 8 sync(&kvm->srcu); Note, there are likely more potential deadlocks in KVM x86, e.g. the same pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with __kvmclock_cpufreq_notifier(): cpuhp_cpufreq_online() | -> cpufreq_online() | -> cpufreq_gov_performance_limits() | -> __cpufreq_driver_target() | -> __target_index() | -> cpufreq_freq_transition_begin() | -> cpufreq_notify_transition() | -> ... __kvmclock_cpufreq_notifier() But, actually triggering such deadlocks is beyond rare due to the combination of dependencies and timings involved. E.g. the cpufreq notifier is only used on older CPUs without a constant TSC, mucking with the NX hugepage mitigation while VMs are running is very uncommon, and doing so while also onlining/offlining a CPU (necessary to generate contention on cpu_hotplug_lock) would be even more unusual. The most robust solution to the general cpu_hotplug_lock issue is likely to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq notifier doesn't to take kvm_lock. For now, settle for fixing the most blatant deadlock, as switching to an RCU-protected list is a much more involved change, but add a comment in locking.rst to call out that care needs to be taken when walking holding kvm_lock and walking vm_list. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S O ------------------------------------------------------ tee/35048 is trying to acquire lock: ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] but task is already holding lock: ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (kvm_lock){+.+.}-{3:3}: __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 kvm_dev_ioctl+0x4fb/0xe50 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2e/0xb0 static_key_slow_inc+0x16/0x30 kvm_lapic_set_base+0x6a/0x1c0 [kvm] kvm_set_apic_base+0x8f/0xe0 [kvm] kvm_set_msr_common+0x9ae/0xf80 [kvm] vmx_set_msr+0xa54/0xbe0 [kvm_intel] __kvm_set_msr+0xb6/0x1a0 [kvm] kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&kvm->srcu){.+.+}-{0:0}: __synchronize_srcu+0x44/0x1a0 synchronize_srcu_expedited+0x21/0x30 kvm_swap_active_memslots+0x110/0x1c0 [kvm] kvm_set_memslot+0x360/0x620 [kvm] __kvm_set_memory_region+0x27b/0x300 [kvm] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] kvm_vm_ioctl+0x295/0x650 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&kvm->slots_lock){+.+.}-{3:3}: __lock_acquire+0x15ef/0x2e30 lock_acquire+0xe0/0x260 __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 set_nx_huge_pages+0x179/0x1e0 [kvm] param_attr_store+0x93/0x100 module_attr_store+0x22/0x40 sysfs_kf_write+0x81/0xb0 kernfs_fop_write_iter+0x133/0x1d0 vfs_write+0x28d/0x380 ksys_write+0x70/0xe0 __x64_sys_write+0x1f/0x30 x64_sys_call+0x281b/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: Chao Gao <chao.gao@intel.com> Fixes: 0bf50497f03b ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)Sean Christopherson
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't directly emulate instructions for ES/SNP, and instead the guest must explicitly request emulation. Unless the guest explicitly requests emulation without accessing memory, ES/SNP relies on KVM creating an MMIO SPTE, with the subsequent #NPF being reflected into the guest as a #VC. But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs, because except for ES/SNP, doing so requires setting reserved bits in the SPTE, i.e. the SPTE can't be readable while also generating a #VC on writes. Because KVM never creates MMIO SPTEs and jumps directly to emulation, the guest never gets a #VC. And since KVM simply resumes the guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU into an infinite #NPF loop if the vCPU attempts to write read-only memory. Disallow read-only memory for all VMs with protected state, i.e. for upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible to support read-only memory, as TDX uses EPT Violation #VE to reflect the fault into the guest, e.g. KVM could configure read-only SPTEs with RX protections and SUPPRESS_VE=0. But there is no strong use case for supporting read-only memslots on TDX, e.g. the main historical usage is to emulate option ROMs, but TDX disallows executing from shared memory. And if someone comes along with a legitimate, strong use case, the restriction can always be lifted for TDX. Don't bother trying to retroactively apply the restriction to SEV-ES VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't possibly work for SEV-ES, i.e. disallowing such memslots is really just means reporting an error to userspace instead of silently hanging vCPUs. Trying to deal with the ordering between KVM_SEV_INIT and memslot creation isn't worth the marginal benefit it would provide userspace. Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES") Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support") Cc: Peter Gonda <pgonda@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Ackerly Tng <ackerleytng@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240809190319.1710470-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: extend kvm_range_has_memory_attributes() to check subset of attributesPaolo Bonzini
While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM code such as kvm_mem_is_private() is written to expect their existence. Allow using kvm_range_has_memory_attributes() as a multi-page version of kvm_mem_is_private(), without it breaking later when more attributes are introduced. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes()Paolo Bonzini
Use a guard to simplify early returns, and add two more easy shortcuts. If the requested attributes are invalid, the attributes xarray will never show them as set. And if testing a single page, kvm_get_memory_attributes() is more efficient. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups
2024-07-12Merge tag 'loongarch-kvm-6.11' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch.
2024-07-12KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memoryIsaku Yamahata
Add a new ioctl KVM_PRE_FAULT_MEMORY in the KVM common code. It iterates on the memory range and calls the arch-specific function. The implementation is optional and enabled by a Kconfig symbol. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <819322b8f25971f2b9933bfa4506e618508ad782.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-28KVM: Add missing MODULE_DESCRIPTION()Jeff Johnson
Add a module description for kvm.ko to fix a 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm.o Opportunistically update kvm_main.c's comically stale file comment to match the module description. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com [sean: split x86 changes to a separate commit, remove stale VT-x comment] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-21Merge tag 'kvm-x86-fixes-6.10-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM fixes for 6.10 - Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest. - Compute the max mappable gfn for KVM selftests on x86 using GuestMaxPhyAddr from KVM's supported CPUID (if it's available). - Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are atomic. - Fix technically benign bug in __kvm_handle_hva_range() where KVM consumes the return from a void-returning function as if it were a boolean.
2024-06-20kvm: do not account temporary allocations to kmemAlexey Dobriyan
Some allocations done by KVM are temporary, they are created as result of program actions, but can't exists for arbitrary long times. They should have been GFP_TEMPORARY (rip!). OTOH, kvm-nx-lpage-recovery and kvm-pit kernel threads exist for as long as VM exists but their task_struct memory is not accounted. This is story for another day. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Message-ID: <c0122f66-f428-417e-a360-b25fc0f154a0@p183> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-06-18KVM: Mark a vCPU as preempted/ready iff it's scheduled out while runningDavid Matlack
Mark a vCPU as preempted/ready if-and-only-if it's scheduled out while running. i.e. Do not mark a vCPU preempted/ready if it's scheduled out during a non-KVM_RUN ioctl() or when userspace is doing KVM_RUN with immediate_exit. Commit 54aa83c90198 ("KVM: x86: do not set st->preempted when going back to user space") stopped marking a vCPU as preempted when returning to userspace, but if userspace then invokes a KVM vCPU ioctl() that gets preempted, the vCPU will be marked preempted/ready. This is arguably incorrect behavior since the vCPU was not actually preempted while the guest was running, it was preempted while doing something on behalf of userspace. Marking a vCPU preempted iff its running also avoids KVM dirtying guest memory after userspace has paused vCPUs, e.g. for live migration, which allows userspace to collect the final dirty bitmap before or in parallel with saving vCPU state, without having to worry about saving vCPU state triggering writes to guest memory. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-4-dmatlack@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: Ensure new code that references immediate_exit gets extra scrutinyDavid Matlack
Ensure that any new KVM code that references immediate_exit gets extra scrutiny by renaming it to immediate_exit__unsafe in kernel code. All fields in struct kvm_run are subject to TOCTOU races since they are mapped into userspace, which may be malicious or buggy. To protect KVM, introduces a new macro that appends __unsafe to select field names in struct kvm_run, hinting to developers and reviewers that accessing such fields must be done carefully. Apply the new macro to immediate_exit, since userspace can make immediate_exit inconsistent with vcpu->wants_to_run, i.e. accessing immediate_exit directly could lead to unexpected bugs in the future. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-3-dmatlack@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: Introduce vcpu->wants_to_runDavid Matlack
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: Reject overly excessive IDs in KVM_CREATE_VCPUMathias Krause
If, on a 64 bit system, a vCPU ID is provided that has the upper 32 bits set to a non-zero value, it may get accepted if the truncated to 32 bits integer value is below KVM_MAX_VCPU_IDS and 'max_vcpus'. This feels very wrong and triggered the reporting logic of PaX's SIZE_OVERFLOW plugin. Instead of silently truncating and accepting such values, pass the full value to kvm_vm_ioctl_create_vcpu() and make the existing limit checks return an error. Even if this is a userland ABI breaking change, no sane userland could have ever relied on that behaviour. Reported-by: PaX's SIZE_OVERFLOW plugin running on grsecurity's syzkaller Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface") Cc: Emese Revfy <re.emese@gmail.com> Cc: PaX Team <pageexec@freemail.hu> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-2-minipli@grsecurity.net [sean: tweak comment about INT_MAX assertion] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-18KVM: Stop processing *all* memslots when "null" mmu_notifier handler is foundBabu Moger
Bail from outer address space loop, not just the inner memslot loop, when a "null" handler is encountered by __kvm_handle_hva_range(), which is the intended behavior. On x86, which has multiple address spaces thanks to SMM emulation, breaking from just the memslot loop results in undefined behavior due to assigning the non-existent return value from kvm_null_fn() to a bool. In practice, the bug is benign as kvm_mmu_notifier_invalidate_range_end() is the only caller that passes handler=kvm_null_fn, and it doesn't set flush_on_ret, i.e. assigning garbage to r.ret is ultimately ignored. And for most configuration the compiler elides the entire sequence, i.e. there is no undefined behavior at runtime. ------------[ cut here ]------------ UBSAN: invalid-load in arch/x86/kvm/../../../virt/kvm/kvm_main.c:655:10 load of value 160 is not a valid value for type '_Bool' CPU: 370 PID: 8246 Comm: CPU 0/KVM Not tainted 6.8.2-amdsos-build58-ubuntu-22.04+ #1 Hardware name: AMD Corporation Sh54p/Sh54p, BIOS WPC4429N 04/25/2024 Call Trace: <TASK> dump_stack_lvl+0x48/0x60 ubsan_epilogue+0x5/0x30 __ubsan_handle_load_invalid_value+0x79/0x80 kvm_mmu_notifier_invalidate_range_end.cold+0x18/0x4f [kvm] __mmu_notifier_invalidate_range_end+0x63/0xe0 __split_huge_pmd+0x367/0xfc0 do_huge_pmd_wp_page+0x1cc/0x380 __handle_mm_fault+0x8ee/0xe50 handle_mm_fault+0xe4/0x4a0 __get_user_pages+0x190/0x840 get_user_pages_unlocked+0xe0/0x590 hva_to_pfn+0x114/0x550 [kvm] kvm_faultin_pfn+0xed/0x5b0 [kvm] kvm_tdp_page_fault+0x123/0x170 [kvm] kvm_mmu_page_fault+0x244/0xaa0 [kvm] vcpu_enter_guest+0x592/0x1070 [kvm] kvm_arch_vcpu_ioctl_run+0x145/0x8a0 [kvm] kvm_vcpu_ioctl+0x288/0x6d0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x77/0x120 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> ---[ end trace ]--- Fixes: 071064f14d87 ("KVM: Don't take mmu_lock for range invalidation unless necessary") Signed-off-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/b8723d39903b64c241c50f5513f804390c7b5eec.1718203311.git.babu.moger@amd.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-14KVM: Fix a goof where kvm_create_vm() returns 0 instead of -ENOMEMDan Carpenter
The error path for OOM when allocating buses used to return -ENOMEM using the local variable 'r', where 'r' was initialized at the top of the function. But a new "r = kvm_init_irq_routing(kvm);" was introduced in the middle of the function, so now the error code is not set and it eventually leads to a NULL dereference due to kvm_dev_ioctl_create_vm() thinking kvm_create_vm() succeeded. Set the error code back to -ENOMEM. Opportunistically tweak the logic to pre-set "r = -ENOMEM" immediately before the flows that can fail due to memory allocation failure to make it less likely that the bug recurs in the future. Fixes: fbe4a7e881d4 ("KVM: Setup empty IRQ routing when creating a VM") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/02051e0a-09d8-49a2-917f-7c2f278a1ba1@moroto.mountain [sean: tweak all of the "r = -ENOMEM" sites, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: Delete the now unused kvm_arch_sched_in()Sean Christopherson
Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: Add a flag to track if a loaded vCPU is scheduled outSean Christopherson
Add a kvm_vcpu.scheduled_out flag to track if a vCPU is in the process of being scheduled out (vCPU put path), or if the vCPU is being reloaded after being scheduled out (vCPU load path). In the short term, this will allow dropping kvm_arch_sched_in(), as arch code can query scheduled_out during kvm_arch_vcpu_load(). Longer term, scheduled_out opens up other potential optimizations, without creating subtle/brittle dependencies. E.g. it allows KVM to keep guest state (that is managed via kvm_arch_vcpu_{load,put}()) loaded across kvm_sched_{out,in}(), if KVM knows the state isn't accessed by the host kernel. Forcing arch code to coordinate between kvm_arch_sched_{in,out}() and kvm_arch_vcpu_{load,put}() is awkward, not reusable, and relies on the exact ordering of calls into arch code. Adding scheduled_out also obviates the need for a kvm_arch_sched_out() hook, e.g. if arch code needs to do something novel when putting vCPU state. And even if KVM never uses scheduled_out for anything beyond dropping kvm_arch_sched_in(), just being able to remove all of the arch stubs makes it worth adding the flag. Link: https://lore.kernel.org/all/20240430224431.490139-1-seanjc@google.com Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11KVM: Setup empty IRQ routing when creating a VMYi Wang
Setup empty IRQ routing during VM creation so that x86 and s390 don't need to set empty/dummy IRQ routing during KVM_CREATE_IRQCHIP (in future patches). Initializing IRQ routing before there are any potential readers allows KVM to avoid the synchronize_srcu() in kvm_set_irq_routing(), which can introduces 20+ milliseconds of latency in the VM creation path. Ensuring that all VMs have non-NULL IRQ routing also hardens KVM against misbehaving userspace VMMs, e.g. RISC-V dynamically instantiates its interrupt controller, but doesn't override kvm_arch_intc_initialized() or kvm_arch_irqfd_allowed(), and so can likely reach kvm_irq_map_gsi() without fully initialized IRQ routing. Signed-off-by: Yi Wang <foxywang@tencent.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20240506101751.3145407-2-foxywang@tencent.com [sean: init refcount after IRQ routing, fix stub, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin()Breno Leitao
Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the loads and stores are atomic. In the extremely unlikely scenario the compiler tears the stores, it's theoretically possible for KVM to attempt to get a vCPU using an out-of-bounds index, e.g. if the write is split into multiple 8-bit stores, and is paired with a 32-bit load on a VM with 257 vCPUs: CPU0 CPU1 last_boosted_vcpu = 0xff; (last_boosted_vcpu = 0x100) last_boosted_vcpu[15:8] = 0x01; i = (last_boosted_vcpu = 0x1ff) last_boosted_vcpu[7:0] = 0x00; vcpu = kvm->vcpu_array[0x1ff]; As detected by KCSAN: BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm] write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x00000012 -> 0x00000000 Fixes: 217ece6129f2 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: Enable halt polling shrink parameter by defaultParshuram Sangle
Default halt_poll_ns_shrink value of 0 always resets polling interval to 0 on an un-successful poll where vcpu wakeup is not received. This is mostly to avoid pointless polling for more number of shorter intervals. But disabled shrink assumes vcpu wakeup is less likely to be received in subsequent shorter polling intervals. Another side effect of 0 shrink value is that, even on a successful poll if total block time was greater than current polling interval, the polling interval starts over from 0 instead of shrinking by a factor. Enabling shrink with value of 2 allows the polling interval to gradually decrement in case of un-successful poll events as well. This gives a fair chance for successful polling events in subsequent polling intervals rather than resetting it to 0 and starting over from grow_start. Below kvm stat log snippet shows interleaved growth and shrinking of polling interval: 87162647182125: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0) 87162647637763: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) 87162649627943: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162650892407: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162651540378: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162652276768: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162652515037: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162653383787: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162653627670: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (shrink 20000) 87162653796321: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) 87162656171645: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (shrink 20000) 87162661607487: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000) Having both grow and shrink enabled creates a balance in polling interval growth and shrink behavior. Tests show improved successful polling attempt ratio which contribute to VM performance. Power penalty is quite negligible as shrunk polling intervals create bursts of very short durations. Performance assessment results show 3-6% improvements in CPU+GPU, Memory and Storage Android VM workloads whereas 5-9% improvement in average FPS of gaming VM workloads. Power penalty is below 1% where host OS is either idle or running a native workload having 2 VMs enabled. CPU/GPU intensive gaming workloads as well do not show any increased power overhead with shrink enabled. Co-developed-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Parshuram Sangle <parshuram.sangle@intel.com> Link: https://lore.kernel.org/r/20231102154628.2120-2-parshuram.sangle@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: Unexport kvm_debugfs_dirBorislav Petkov
After faf01aef0570 ("KVM: PPC: Merge powerpc's debugfs entry content into generic entry") kvm_debugfs_dir is not used anywhere else outside of kvm_main.c Unexport it and make it static. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240515150804.9354-1-bp@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-12Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.10: - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is unused by hardware, so that KVM can communicate its inability to map GPAs that set bits 51:48 due to lack of 5-level paging. Guest firmware is expected to use the information to safely remap BARs in the uppermost GPA space, i.e to avoid placing a BAR at a legal, but unmappable, GPA. - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Don't completely ignore same-value writes to immutable feature MSRs, as doing so results in KVM failing to reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled KVM-wide to avoid confusing debuggers (KVM will never bother clearing the ABSENT inhibit, even if userspace enables in-kernel local APIC).
2024-05-12Merge tag 'kvm-x86-generic-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM cleanups for 6.10: - Misc cleanups extracted from the "exit on missing userspace mapping" series, which has been put on hold in anticipation of a "KVM Userfault" approach, which should provide a superset of functionality. - Remove kvm_make_all_cpus_request_except(), which got added to hack around an AVIC bug, and then became dead code when a more robust fix came along. - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
2024-05-12Merge tag 'kvmarm-6.10-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.10 - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greattly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements.
2024-05-10Merge tag 'loongarch-kvm-6.10' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.
2024-05-05mm: pass VMA instead of MM to follow_pte()David Hildenbrand
... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll now also perform these sanity checks for direct follow_pte() invocations. For generic_access_phys(), we might now check multiple times: nothing to worry about, really. Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> [KVM] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-02KVM: Remove kvm_make_all_cpus_request_except()Venkatesh Srinivas
Remove kvm_make_all_cpus_request_except() as it effectively has no users, and arguably should never have been added in the first place. Commit 54163a346d4a ("KVM: Introduce kvm_make_all_cpus_request_except()") added the "except" variation for use in SVM's AVIC update path, which used it to skip sending a request to the current vCPU (commit 7d611233b016 ("KVM: SVM: Disable AVIC before setting V_IRQ")). But the AVIC usage of kvm_make_all_cpus_request_except() was essentially a hack-a-fix that simply squashed the most likely scenario of a racy WARN without addressing the underlying problem(s). Commit f1577ab21442 ("KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated") eventually fixed the WARN itself, and the "except" usage was subsequently dropped by df63202fe52b ("KVM: x86: APICv: drop immediate APICv disablement on current vCPU"). That kvm_make_all_cpus_request_except() hasn't gained any users in the last ~3 years isn't a coincidence. If a VM-wide broadcast *needs* to skip the current vCPU, then odds are very good that there is underlying bug that could be better fixed elsewhere. Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org> Link: https://lore.kernel.org/r/20240404232651.1645176-1-venkateshs@chromium.org [sean: rewrite changelog with --verbose] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-25KVM: Treat the device list as an rculistOliver Upton
A subsequent change to KVM/arm64 will necessitate walking the device list outside of the kvm->lock. Prepare by converting to an rculist. This has zero effect on the VM destruction path, as it is expected every reader is backed by a reference on the kvm struct. On the other hand, ensure a given device is completely destroyed before dropping the kvm->lock in the release() path, as certain devices expect to be a singleton (e.g. the vfio-kvm device). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20240422200158.2606761-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-04-19KVM: Allow page-sized MMU caches to be initialized with custom 64-bit valuesSean Christopherson
Add support to MMU caches for initializing a page with a custom 64-bit value, e.g. to pre-fill an entire page table with non-zero PTE values. The functionality will be used by x86 to support Intel's TDX, which needs to set bit 63 in all non-present PTEs in order to prevent !PRESENT page faults from getting reflected into the guest (Intel's EPT Violation #VE architecture made the less than brilliant decision of having the per-PTE behavior be opt-out instead of opt-in). Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5919f685f109a1b0ebc6bd8fc4536ee94bcc172d.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()Sean Christopherson
Remove gfn_to_pfn_cache_invalidate_start()'s unused @may_block parameter, which was leftover from KVM's abandoned (for now) attempt to support guest usage of gfn_to_pfn caches. Fixes: a4bff3df5147 ("KVM: pfncache: remove KVM_GUEST_USES_PFN usage") Reported-by: Like Xu <like.xu.linux@gmail.com> Cc: Paul Durrant <paul@xen.org> Cc: David Woodhouse <dwmw2@infradead.org> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240305003742.245767-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>