summaryrefslogtreecommitdiff
path: root/virt
AgeCommit message (Collapse)Author
44 hoursMerge tag 'kvm-x86-no_assignment-6.17' of https://github.com/kvm-x86/linux ↵Paolo Bonzini
into HEAD KVM VFIO device assignment cleanups for 6.17 Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking now that KVM no longer uses assigned_device_count as a bad heuristic for "VM has an irqbypass producer" or for "VM has access to host MMIO".
44 hoursMerge tag 'kvm-x86-dirty_ring-6.17' of https://github.com/kvm-x86/linux into ↵Paolo Bonzini
HEAD KVM Dirty Ring changes for 6.17 Fix issues with dirty ring harvesting where KVM doesn't bound the processing of entries in any way, which allows userspace to keep KVM in a tight loop indefinitely. Clean up code and comments along the way.
44 hoursMerge tag 'kvm-x86-generic-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM generic changes for 6.17 - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related to private <=> shared memory conversions. - Drop guest_memfd's .getattr() implementation as the VFS layer will call generic_fillattr() if inode_operations.getattr is NULL.
44 hoursMerge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM IRQ changes for 6.17 - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI. - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *", to avoid making a simple concept unnecessarily difficult to understand. - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time. - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c. - Fix a variety of flaws and bugs in the AVIC device posted IRQ code. - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation. - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry. - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs. - Dedup x86's device posted IRQ code, as the vast majority of functionality can be shared verbatime between SVM and VMX. - Harden the device posted IRQ code against bugs and runtime errors. - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n). - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU. - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs. - Clean up and document/comment the irqfd assignment code. - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique.
2025-06-25KVM: guest_memfd: Remove redundant kvm_gmem_getattr implementationShivank Garg
Remove the redundant kvm_gmem_getattr() implementation that simply calls generic_fillattr() without any special handling. The VFS layer (vfs_getattr_nosec()) will call generic_fillattr() by default when no custom getattr operation is provided in the inode_operations structure. This is a cleanup with no functional change. Signed-off-by: Shivank Garg <shivankg@amd.com> Reviewed-By: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250602172317.10601-1-shivankg@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25VFIO: KVM: x86: Drop kvm_arch_{start,end}_assignment()Sean Christopherson
Drop kvm_arch_{start,end}_assignment() and all associated code now that KVM x86 no longer consumes assigned_device_count. Tracking whether or not a VFIO-assigned device is formally associated with a VM is fundamentally flawed, as such an association is optional for general usage, i.e. is prone to false negatives. E.g. prior to commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"), device passthrough via VFIO would fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM binding. And device drivers that *need* the VFIO<=>KVM connection, e.g. KVM-GT, shouldn't be relying on generic x86 tracking infrastructure. Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: Allow CPU to reschedule while setting per-page memory attributesLiam Merwick
When running an SEV-SNP guest with a sufficiently large amount of memory (1TB+), the host can experience CPU soft lockups when running an operation in kvm_vm_set_mem_attributes() to set memory attributes on the whole range of guest memory. watchdog: BUG: soft lockup - CPU#8 stuck for 26s! [qemu-kvm:6372] CPU: 8 UID: 0 PID: 6372 Comm: qemu-kvm Kdump: loaded Not tainted 6.15.0-rc7.20250520.el9uek.rc1.x86_64 #1 PREEMPT(voluntary) Hardware name: Oracle Corporation ORACLE SERVER E4-2c/Asm,MB Tray,2U,E4-2c, BIOS 78016600 11/13/2024 RIP: 0010:xas_create+0x78/0x1f0 Code: 00 00 00 41 80 fc 01 0f 84 82 00 00 00 ba 06 00 00 00 bd 06 00 00 00 49 8b 45 08 4d 8d 65 08 41 39 d6 73 20 83 ed 06 48 85 c0 <74> 67 48 89 c2 83 e2 03 48 83 fa 02 75 0c 48 3d 00 10 00 00 0f 87 RSP: 0018:ffffad890a34b940 EFLAGS: 00000286 RAX: ffff96f30b261daa RBX: ffffad890a34b9c8 RCX: 0000000000000000 RDX: 000000000000001e RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffad890a356868 R13: ffffad890a356860 R14: 0000000000000000 R15: ffffad890a356868 FS: 00007f5578a2a400(0000) GS:ffff97ed317e1000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f015c70fb18 CR3: 00000001109fd006 CR4: 0000000000f70ef0 PKRU: 55555554 Call Trace: <TASK> xas_store+0x58/0x630 __xa_store+0xa5/0x130 xa_store+0x2c/0x50 kvm_vm_set_mem_attributes+0x343/0x710 [kvm] kvm_vm_ioctl+0x796/0xab0 [kvm] __x64_sys_ioctl+0xa3/0xd0 do_syscall_64+0x8c/0x7a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f5578d031bb Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d 4c 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe0a742b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 000000004020aed2 RCX: 00007f5578d031bb RDX: 00007ffe0a742c80 RSI: 000000004020aed2 RDI: 000000000000000b RBP: 0000010000000000 R08: 0000010000000000 R09: 0000017680000000 R10: 0000000000000080 R11: 0000000000000246 R12: 00005575e5f95120 R13: 00007ffe0a742c80 R14: 0000000000000008 R15: 00005575e5f961e0 While looping through the range of memory setting the attributes, call cond_resched() to give the scheduler a chance to run a higher priority task on the runqueue if necessary and avoid staying in kernel mode long enough to trigger the lockup. Fixes: 5a475554db1e ("KVM: Introduce per-page memory attributes") Cc: stable@vger.kernel.org # 6.12.x Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250609091121.2497429-2-liam.merwick@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Drop sanity check that per-VM list of irqfds is uniqueSean Christopherson
Now that the eventfd's waitqueue ensures it has at most one priority waiter, i.e. prevents KVM from binding multiple irqfds to one eventfd, drop KVM's sanity check that eventfds are unique for a single VM. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Disallow binding multiple irqfds to an eventfd with a priority waiterSean Christopherson
Disallow binding an irqfd to an eventfd that already has a priority waiter, i.e. to an eventfd that already has an attached irqfd. KVM always operates in exclusive mode for EPOLL_IN (unconditionally returns '1'), i.e. only the first waiter will be notified. KVM already disallows binding multiple irqfds to an eventfd in a single VM, but doesn't guard against multiple VMs binding to an eventfd. Adding the extra protection reduces the pain of a userspace VMM bug, e.g. if userspace fails to de-assign before re-assigning when transferring state for intra-host migration, then the migration will explicitly fail as opposed to dropping IRQs on the destination VM. Temporarily keep KVM's manual check on irqfds.items, but add a WARN, e.g. to allow sanity checking the waitqueue enforcement. Cc: Oliver Upton <oliver.upton@linux.dev> Cc: David Matlack <dmatlack@google.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()Sean Christopherson
Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and instead have callers manually add the flag prior to adding their structure to the queue. Blindly setting WQ_FLAG_EXCLUSIVE is flawed, as the nature of exclusive, priority waiters means that only the first waiter added will ever receive notifications. Pushing the flawed behavior to callers will allow fixing the problem one hypervisor at a time (KVM added the flawed API, and then KVM's code was copy+pasted nearly verbatim by Xen and Hyper-V), and will also allow for adding an API that provides true exclusivity, i.e. that guarantees at most one priority waiter is in the queue. Opportunistically add a comment in Hyper-V to call out the mess. Xen privcmd's irqfd_wakefup() doesn't actually operate in exclusive mode, i.e. can be "fixed" simply by dropping WQ_FLAG_EXCLUSIVE. And KVM is primed to switch to the aforementioned fully exclusive API, i.e. won't be carrying the flawed code for long. No functional change intended. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lockSean Christopherson
Add an irqfd to its target eventfd's waitqueue while holding irqfds.lock, which is mildly terrifying but functionally safe. irqfds.lock is taken inside the waitqueue's lock, but if and only if the eventfd is being released, i.e. that path is mutually exclusive with registration as KVM holds a reference to the eventfd (and obviously must do so to avoid UAF). This will allow using the eventfd's waitqueue to enforce KVM's requirement that eventfd is assigned to at most one irqfd, without introducing races. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Add irqfd to KVM's list via the vfs_poll() callbackSean Christopherson
Add the irqfd structure to KVM's list of irqfds in kvm_irqfd_register(), i.e. via the vfs_poll() callback. This will allow taking irqfds.lock across the entire registration sequence (add to waitqueue, add to list), and more importantly will allow inserting into KVM's list if and only if adding to the waitqueue succeeds (spoiler alert), without needing to juggle return codes in weird ways. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Initialize irqfd waitqueue callback when adding to the queueSean Christopherson
Initialize the irqfd waitqueue callback immediately prior to inserting the irqfd into the eventfd's waitqueue. Pre-initializing the state in a completely different context is all kinds of confusing, and incorrectly suggests that the waitqueue function needs to be initialize prior to vfs_poll(). Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Acquire SCRU lock outside of irqfds.lock during assignmentSean Christopherson
Acquire SRCU outside of irqfds.lock so that the locking is symmetrical, and add a comment explaining why on earth KVM holds SRCU for so long. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Use a local struct to do the initial vfs_poll() on an irqfdSean Christopherson
Use a function-local struct for the poll_table passed to vfs_poll(), as nothing in the vfs_poll() callchain grabs a long-term reference to the structure, i.e. its lifetime doesn't need to be tied to the irqfd. Using a local structure will also allow propagating failures out of the polling callback without further polluting kvm_kernel_irqfd. Opportunstically rename irqfd_ptable_queue_proc() to kvm_irqfd_register() to capture what it actually does. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()Sean Christopherson
Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing(). Calling arch code to know whether or not to call arch code is absurd. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250611224604.313496-35-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Don't WARN if updating IRQ bypass route failsSean Christopherson
Don't bother WARNing if updating an IRTE route fails now that vendor code provides much more precise WARNs. The generic WARN doesn't provide enough information to actually debug the problem, and has obviously done nothing to surface the myriad bugs in KVM x86's implementation. Drop all of the associated return code plumbing that existed just so that common KVM could WARN. Link: https://lore.kernel.org/r/20250611224604.313496-34-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: fix typo in kvm_vm_set_mem_attributes() commentLiam Merwick
It should be 'has' in the sentence and not 'as'. Signed-off-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250609091121.2497429-4-liam.merwick@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Add trace_kvm_vm_set_mem_attributes()Liam Merwick
Add a tracing function that, for a guest memory range, displays the start and end addresses plus the per-page attributes being set. Signed-off-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250609091121.2497429-3-liam.merwick@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Pass new routing entries and irqfd when updating IRTEsSean Christopherson
When updating IRTEs in response to a GSI routing or IRQ bypass change, pass the new/current routing information along with the associated irqfd. This will allow KVM x86 to harden, simplify, and deduplicate its code. Since adding/removing a bypass producer is now conveniently protected with irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(), use the routing information cached in the irqfd instead of looking up the information in the current GSI routing tables. Opportunistically convert an existing printk() to pr_info() and put its string onto a single line (old code that strictly adhered to 80 chars). Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()Sean Christopherson
Trigger the I/O APIC route rescan that's performed for a split IRQ chip after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e. before dropping kvm->irq_lock. Calling kvm_make_all_cpus_request() under a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair in __kvm_make_request()+kvm_check_request() ensures the new routing is visible to vCPUs prior to the request being visible to vCPUs. In all likelihood, commit b053b2aef25d ("KVM: x86: Add EOI exit bitmap inference") somewhat arbitrarily made the request outside of irq_lock to avoid holding irq_lock any longer than is strictly necessary. And then commit abdb080f7ac8 ("kvm/irqchip: kvm_arch_irq_routing_update renaming split") took the easy route of adding another arch hook instead of risking a functional change. Note, the call to synchronize_srcu_expedited() does NOT provide ordering guarantees with respect to vCPUs scanning the new routing; as above, the request infrastructure provides the necessary ordering. I.e. there's no need to wait for kvm_scan_ioapic_routes() to complete if it's actively running, because regardless of whether it grabs the old or new table, the vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again and see the new mappings. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Require producers to pass in Linux IRQ number during registrationSean Christopherson
Pass in the Linux IRQ associated with an IRQ bypass producer instead of relying on the caller to set the field prior to registration, as there's no benefit to relying on callers to do the right thing. Take care to set producer->irq before __connect(), as KVM expects the IRQ to be valid as soon as a connection is possible. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Use xarray to track producers and consumersSean Christopherson
Track IRQ bypass producers and consumers using an xarray to avoid the O(2n) insertion time associated with walking a list to check for duplicate entries, and to search for an partner. At low (tens or few hundreds) total producer/consumer counts, using a list is faster due to the need to allocate backing storage for xarray. But as count creeps into the thousands, xarray wins easily, and can provide several orders of magnitude better latency at high counts. E.g. hundreds of nanoseconds vs. hundreds of milliseconds. Cc: Oliver Upton <oliver.upton@linux.dev> Cc: David Matlack <dmatlack@google.com> Cc: Like Xu <like.xu.linux@gmail.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379 Link: https://lore.kernel.org/all/20230801115646.33990-1-likexu@tencent.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Use guard(mutex) in lieu of manual lock+unlockSean Christopherson
Use guard(mutex) to clean up irqbypass's error handling. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Use paired consumer/producer to disconnect during unregisterSean Christopherson
Use the paired consumer/producer information to disconnect IRQ bypass producers/consumers in O(1) time (ignoring the cost of __disconnect()). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Explicitly track producer and consumer bindingsSean Christopherson
Explicitly track IRQ bypass producer:consumer bindings. This will allow making removal an O(1) operation; searching through the list to find information that is trivially tracked (and useful for debug) is wasteful. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Take ownership of producer/consumer token trackingSean Christopherson
Move ownership of IRQ bypass token tracking into irqbypass.ko, and explicitly require callers to pass an eventfd_ctx structure instead of a completely opaque token. Relying on producers and consumers to set the token appropriately is error prone, and hiding the fact that the token must be an eventfd_ctx pointer (for all intents and purposes) unnecessarily obfuscates the code and makes it more brittle. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Drop superfluous might_sleep() annotationsSean Christopherson
Drop superfluous might_sleep() annotations from irqbypass, mutex_lock() provides all of the necessary tracking. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Drop pointless and misleading THIS_MODULE get/putSean Christopherson
Drop irqbypass.ko's superfluous and misleading get/put calls on THIS_MODULE. A module taking a reference to itself is useless; no amount of checks will prevent doom and destruction if the caller hasn't already guaranteed the liveliness of the module (this goes for any module). E.g. if try_module_get() fails because irqbypass.ko is being unloaded, then the kernel has already hit a use-after-free by virtue of executing code whose lifecycle is tied to irqbypass.ko. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Assert that slots_lock is held when resetting per-vCPU dirty ringsSean Christopherson
Assert that slots_lock is held in kvm_dirty_ring_reset() and add a comment to explain _why_ slots needs to be held for the duration of the reset. Link: https://lore.kernel.org/all/aCSns6Q5oTkdXUEe@google.com Suggested-by: James Houghton <jthoughton@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250516213540.2546077-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resetsSean Christopherson
Use "mask" instead of a dedicated boolean to track whether or not there is at least one to-be-reset entry for the current slot+offset. In the body of the loop, mask is zero only on the first iteration, i.e. !mask is equivalent to first_round. Opportunistically combine the adjacent "if (mask)" statements into a single if-statement. No functional change intended. Cc: Peter Xu <peterx@redhat.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250516213540.2546077-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Check for empty mask of harvested dirty ring entries in callerSean Christopherson
When resetting a dirty ring, explicitly check that there is work to be done before calling kvm_reset_dirty_gfn(), e.g. if no harvested entries are found and/or on the loop's first iteration, and delete the extremely misleading comment "This is only needed to make compilers happy". KVM absolutely relies on mask to be zero-initialized, i.e. the comment is an outright lie. Furthermore, the compiler is right to complain that KVM is calling a function with uninitialized data, as there are no guarantees the implementation details of kvm_reset_dirty_gfn() will be visible to kvm_dirty_ring_reset(). While the flaw could be fixed by simply deleting (or rewording) the comment, and duplicating the check is unfortunate, checking mask in the caller will allow for additional cleanups. Opportunistically drop the zero-initialization of cur_slot and cur_offset. If a bug were introduced where either the slot or offset was consumed before mask is set to a non-zero value, then it is highly desirable for the compiler (or some other sanitizer) to yell. Cc: Peter Xu <peterx@redhat.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250516213540.2546077-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Conditionally reschedule when resetting the dirty ringSean Christopherson
When resetting a dirty ring, conditionally reschedule on each iteration after the first. The recently introduced hard limit mitigates the issue of an endless reset, but isn't sufficient to completely prevent RCU stalls, soft lockups, etc., nor is the hard limit intended to guard against such badness. Note! Take care to check for reschedule even in the "continue" paths, as a pathological scenario (or malicious userspace) could dirty the same gfn over and over, i.e. always hit the continue path. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563 rcu: (t=5250 jiffies g=-319 q=608 ncpus=24) CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G L 6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814 Tainted: [L]=SOFTLOCKUP Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm] Call Trace: <TASK> kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm] kvm_dirty_ring_reset+0x58/0x220 [kvm] kvm_vm_ioctl+0x10eb/0x15d0 [kvm] __x64_sys_ioctl+0x8b/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Tainted: [L]=SOFTLOCKUP Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm] Call Trace: <TASK> kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm] kvm_dirty_ring_reset+0x58/0x220 [kvm] kvm_vm_ioctl+0x10eb/0x15d0 [kvm] __x64_sys_ioctl+0x8b/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking") Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250516213540.2546077-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Bail from the dirty ring reset flow if a signal is pendingSean Christopherson
Abort a dirty ring reset if the current task has a pending signal, as the hard limit of INT_MAX entries doesn't ensure KVM will respond to a signal in a timely fashion. Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking") Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250516213540.2546077-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Bound the number of dirty ring entries in a single reset at INT_MAXSean Christopherson
Cap the number of ring entries that are reset in a single ioctl to INT_MAX to ensure userspace isn't confused by a wrap into negative space, and so that, in a truly pathological scenario, KVM doesn't miss a TLB flush due to the count wrapping to zero. While the size of the ring is fixed at 0x10000 entries and KVM (currently) supports at most 4096, userspace is allowed to harvest entries from the ring while the reset is in-progress, i.e. it's possible for the ring to always have harvested entries. Opportunistically return an actual error code from the helper so that a future fix to handle pending signals can gracefully return -EINTR. Drop the function comment now that the return code is a stanard 0/-errno (and because a future commit will add a proper lockdep assertion). Opportunistically drop a similarly stale comment for kvm_dirty_ring_push(). Cc: Peter Xu <peterx@redhat.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking") Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250516213540.2546077-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-28Merge branch 'kvm-lockdep-common' into HEADPaolo Bonzini
Introduce new mutex locking functions mutex_trylock_nest_lock() and mutex_lock_killable_nest_lock() and use them to clean up locking of all vCPUs for a VM. For x86, this removes some complex code that was used instead of lockdep's "nest_lock" feature. For ARM and RISC-V, this removes a lockdep warning when the VM is configured to have more than MAX_LOCK_DEPTH vCPUs, and removes a fair amount of duplicate code by sharing the logic across all architectures. Signed-off-by: Paolo BOnzini <pbonzini@redhat.com>
2025-05-27KVM: add kvm_lock_all_vcpus and kvm_trylock_all_vcpusMaxim Levitsky
In a few cases, usually in the initialization code, KVM locks all vCPUs of a VM to ensure that userspace doesn't do funny things while KVM performs an operation that affects the whole VM. Until now, all these operations were implemented using custom code, and all of them share the same problem: Lockdep can't cope with simultaneous locking of a large number of locks of the same class. However if these locks are taken while another lock is already held, which is luckily the case, it is possible to take advantage of little known _nest_lock feature of lockdep which allows in this case to have an unlimited number of locks of same class to be taken. To implement this, create two functions: kvm_lock_all_vcpus() and kvm_trylock_all_vcpus() Both functions are needed because some code that will be replaced in the subsequent patches, uses mutex_trylock, instead of regular mutex_lock. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.16: - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN. - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM. - Add support for ALLOWED_SEV_FEATURES. - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y. - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits. - Don't account temporary allocations in sev_send_update_data(). - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
2025-05-08KVM: Remove obsolete comment about locking for kvm_io_bus_read/writeLi RongQing
Nobody is actually calling these functions with slots_lock held, The srcu_dereference() in kvm_io_bus_read/write() precisely communicates both what is being protected, and what provides the protection. so the comments are no longer needed Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20250506012251.2613-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: SVM: Fix SNP AP destroy race with VMRUNTom Lendacky
An AP destroy request for a target vCPU is typically followed by an RMPADJUST to remove the VMSA attribute from the page currently being used as the VMSA for the target vCPU. This can result in a vCPU that is about to VMRUN to exit with #VMEXIT_INVALID. This usually does not happen as APs are typically sitting in HLT when being destroyed and therefore the vCPU thread is not running at the time. However, if HLT is allowed inside the VM, then the vCPU could be about to VMRUN when the VMSA attribute is removed from the VMSA page, resulting in a #VMEXIT_INVALID when the vCPU actually issues the VMRUN and causing the guest to crash. An RMPADJUST against an in-use (already running) VMSA results in a #NPF for the vCPU issuing the RMPADJUST, so the VMSA attribute cannot be changed until the VMRUN for target vCPU exits. The Qemu command line option '-overcommit cpu-pm=on' is an example of allowing HLT inside the guest. Update the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event to include the KVM_REQUEST_WAIT flag. The kvm_vcpu_kick() function will not wait for requests to be honored, so create kvm_make_request_and_kick() that will add a new event request and honor the KVM_REQUEST_WAIT flag. This will ensure that the target vCPU sees the AP destroy request before returning to the initiating vCPU should the target vCPU be in guest mode. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/fe2c885bf35643dd224e91294edb6777d5df23a4.1743097196.git.thomas.lendacky@amd.com [sean: add a comment explaining the use of smp_send_reschedule()] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-07Merge branch 'kvm-tdx-initial' into HEADPaolo Bonzini
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). The series is in turn split into multiple sub-series, each with a separate merge commit: - Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs. - MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT. - vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state. - Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP. - Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") - Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-07Merge branch 'kvm-6.15-rc2-fixes' into HEADPaolo Bonzini
2025-04-04KVM: Allow building irqbypass.ko as as module when kvm.ko is a moduleSean Christopherson
Convert HAVE_KVM_IRQ_BYPASS into a tristate so that selecting IRQ_BYPASS_MANAGER follows KVM={m,y}, i.e. doesn't force irqbypass.ko to be built-in. Note, PPC allows building KVM as a module, but selects HAVE_KVM_IRQ_BYPASS from a boolean Kconfig, i.e. KVM PPC unnecessarily forces irqbpass.ko to be built-in. But that flaw is a longstanding PPC specific issue. Fixes: 61df71ee992d ("kvm: move "select IRQ_BYPASS_MANAGER" to common code") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250315024623.2363994-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Nested virtualization support for VGICv3, giving the nested hypervisor control of the VGIC hardware when running an L2 VM - Removal of 'late' nested virtualization feature register masking, making the supported feature set directly visible to userspace - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers - Paravirtual interface for discovering the set of CPU implementations where a VM may run, addressing a longstanding issue of guest CPU errata awareness in big-little systems and cross-implementation VM migration - Userspace control of the registers responsible for identifying a particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1), allowing VMs to be migrated cross-implementation - pKVM updates, including support for tracking stage-2 page table allocations in the protected hypervisor in the 'SecPageTable' stat - Fixes to vPMU, ensuring that userspace updates to the vPMU after KVM_RUN are reflected into the backing perf events LoongArch: - Remove unnecessary header include path - Assume constant PGD during VM context switch - Add perf events support for guest VM RISC-V: - Disable the kernel perf counter during configure - KVM selftests improvements for PMU - Fix warning at the time of KVM module removal x86: - Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs. This includes an implementation of per-rmap-entry locking; aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas locking an rmap for write requires taking both the per-rmap spinlock and the mmu_lock. Note that this decreases slightly the accuracy of accessed-page information, because changes to the SPTE outside aging might not use atomic operations even if they could race against a clear of the Accessed bit. This is deliberate because KVM and mm/ tolerate false positives/negatives for accessed information, and testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2) - Drop "support" for async page faults for protected guests that do not set SEND_ALWAYS (i.e. that only want async page faults at CPL3) - Bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. Particularly, destroy vCPUs before the MMU, despite the latter being a VM-wide operation - Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups) - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates - Restrict the Xen hypercall MSR index to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range) - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks; there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of the TSC leaves should be rare, i.e. are not a hot path x86 (Intel): - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1 - Pass XFD_ERR as the payload when injecting #NM, as a preparatory step for upcoming FRED virtualization support - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits, both as a general cleanup and in anticipation of adding support for emulating Mode-Based Execution Control (MBEC) - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits) x86 (AMD): - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies) - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation - Add macros and helpers for setting GHCB return/error codes - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features Selftests: - Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data instead of executing code. The theory is that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that an event is counting correctly without actually knowing what the event counts on the underlying hardware - Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and improve its coverage by collecting all dirty entries on each iteration - Fix a few minor bugs related to handling of stats FDs - Add infrastructure to make vCPU and VM stats FDs available to tests by default (open the FDs during VM/vCPU creation) - Relax an assertion on the number of HLT exits in the xAPIC IPI test when running on a CPU that supports AMD's Idle HLT (which elides interception of HLT if a virtual IRQ is pending and unmasked)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits) RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed RISC-V: KVM: Teardown riscv specific bits after kvm_exit LoongArch: KVM: Register perf callbacks for guest LoongArch: KVM: Implement arch-specific functions for guest perf LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel() LoongArch: KVM: Remove PGD saving during VM context switch LoongArch: KVM: Remove unnecessary header include path KVM: arm64: Tear down vGIC on failed vCPU creation KVM: arm64: PMU: Reload when resetting KVM: arm64: PMU: Reload when user modifies registers KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs KVM: arm64: PMU: Assume PMU presence in pmu-emul.c KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu KVM: arm64: Factor out pKVM hyp vcpu creation to separate function KVM: arm64: Initialize HCRX_EL2 traps in pKVM KVM: arm64: Factor out setting HCRX_EL2 traps into separate function KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected KVM: x86: Add infrastructure for secure TSC KVM: x86: Push down setting vcpu.arch.user_set_tsc ...
2025-03-24Merge tag 'vfs-6.15-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Add CONFIG_DEBUG_VFS infrastucture: - Catch invalid modes in open - Use the new debug macros in inode_set_cached_link() - Use debug-only asserts around fd allocation and install - Place f_ref to 3rd cache line in struct file to resolve false sharing Cleanups: - Start using anon_inode_getfile_fmode() helper in various places - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by f_pos_lock - Add unlikely() to kcmp() - Remove legacy ->remount_fs method from ecryptfs after port to the new mount api - Remove invalidate_inodes() in favour of evict_inodes() - Simplify ep_busy_loopER by removing unused argument - Avoid mmap sem relocks when coredumping with many missing pages - Inline getname() - Inline new_inode_pseudo() and de-staticize alloc_inode() - Dodge an atomic in putname if ref == 1 - Consistently deref the files table with rcu_dereference_raw() - Dedup handling of struct filename init and refcounts bumps - Use wq_has_sleeper() in end_dir_add() - Drop the lock trip around I_NEW wake up in evict() - Load the ->i_sb pointer once in inode_sb_list_{add,del} - Predict not reaching the limit in alloc_empty_file() - Tidy up do_sys_openat2() with likely/unlikely - Call inode_sb_list_add() outside of inode hash lock - Sort out fd allocation vs dup2 race commentary - Turn page_offset() into a wrapper around folio_pos() - Remove locking in exportfs around ->get_parent() call - try_lookup_one_len() does not need any locks in autofs - Fix return type of several functions from long to int in open - Fix return type of several functions from long to int in ioctls Fixes: - Fix watch queue accounting mismatch" * tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: sort out fd allocation vs dup2 race commentary, take 2 fs: call inode_sb_list_add() outside of inode hash lock fs: tidy up do_sys_openat2() with likely/unlikely fs: predict not reaching the limit in alloc_empty_file() fs: load the ->i_sb pointer once in inode_sb_list_{add,del} fs: drop the lock trip around I_NEW wake up in evict() fs: use wq_has_sleeper() in end_dir_add() VFS/autofs: try_lookup_one_len() does not need any locks fs: dedup handling of struct filename init and refcounts bumps fs: consistently deref the files table with rcu_dereference_raw() exportfs: remove locking around ->get_parent() call. fs: use debug-only asserts around fd allocation and install fs: dodge an atomic in putname if ref == 1 vfs: Remove invalidate_inodes() ecryptfs: remove NULL remount_fs from super_operations watch_queue: fix pipe accounting mismatch fs: place f_ref to 3rd cache line in struct file to resolve false sharing epoll: simplify ep_busy_loop by removing always 0 argument fs: Turn page_offset() into a wrapper around folio_pos() kcmp: improve performance adding an unlikely hint to task comparisons ...
2025-03-20Merge branch 'kvm-nvmx-and-vm-teardown' into HEADPaolo Bonzini
The immediate issue being fixed here is a nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). However, checking for a pending interrupt accesses the legacy PIC, and x86's kvm_arch_destroy_vm() currently frees the PIC before destroying vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results in a NULL pointer deref; that's a prerequisite for the nVMX fix. The remaining patches attempt to bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. E.g. KVM currently unloads each vCPU's MMUs in a separate operation from destroying vCPUs, all because when guest SMP support was added, KVM had a kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU. And that oddity lived on, for 18 years... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Handle TDX PV MMIO hypercallSean Christopherson
Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according to TDX Guest Host Communication Interface (GHCI) spec. For TDX guests, VMM is not allowed to access vCPU registers and the private memory, and the code instructions must be fetched from the private memory. So MMIO emulation implemented for non-TDX VMs is not possible for TDX guests. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. The #VE handling is supposed to emulate the MMIO instruction inside the guest and convert it into a TDVMCALL with the leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION. The requested MMIO address must be in shared GPA space. The shared bit is stripped after check because the existing code for MMIO emulation is not aware of the shared bit. The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA should be shared. KVM could do the checks before exiting to userspace, however, even if KVM does the check, there still will be race conditions between the check in KVM and the emulation of MMIO access in userspace due to a memslot hotplug, or a memory attribute conversion. If userspace doesn't check the attribute of the GPA and the attribute happens to be private, it will not pose a security risk or cause an MCE, but it can lead to another issue. E.g., in QEMU, treating a GPA with private attribute as shared when it falls within RAM's range can result in extra memory consumption during the emulation to the access to the HVA of the GPA. There are two options: 1) Do the check both in KVM and userspace. 2) Do the check only in QEMU. This patch chooses option 2, i.e. KVM omits the memslot and attribute checks, and expects userspace to do the checks. Similar to normal MMIO emulation, try to handle the MMIO in kernel first, if kernel can't support it, forward the request to userspace. Export needed symbols used for MMIO handling. Fragments handling is not needed for TDX PV MMIO because GPA is provided, if a MMIO access crosses page boundary, it should be continuous in GPA. Also, the size is limited to 1, 2, 4, 8 bytes. No further split needed. Allow cross page access because no extra handling needed after checking both start and end GPA are shared GPAs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014225.897298-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: Add parameter "kvm" to kvm_cpu_dirty_log_size() and its callersYan Zhao
Add a parameter "kvm" to kvm_cpu_dirty_log_size() and down to its callers: kvm_dirty_ring_get_rsvd_entries(), kvm_dirty_ring_alloc(). This is a preparation to make cpu_dirty_log_size a per-VM value rather than a system-wide value. No function changes expected. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: VMX: Initialize TDX during KVM module loadKai Huang
Before KVM can use TDX to create and run TDX guests, TDX needs to be initialized from two perspectives: 1) TDX module must be initialized properly to a working state; 2) A per-cpu TDX initialization, a.k.a the TDH.SYS.LP.INIT SEAMCALL must be done on any logical cpu before it can run any other TDX SEAMCALLs. The TDX host core-kernel provides two functions to do the above two respectively: tdx_enable() and tdx_cpu_enable(). There are two options in terms of when to initialize TDX: initialize TDX at KVM module loading time, or when creating the first TDX guest. Choose to initialize TDX during KVM module loading time: Initializing TDX module is both memory and CPU time consuming: 1) the kernel needs to allocate a non-trivial size(~1/256) of system memory as metadata used by TDX module to track each TDX-usable memory page's status; 2) the TDX module needs to initialize this metadata, one entry for each TDX-usable memory page. Also, the kernel uses alloc_contig_pages() to allocate those metadata chunks, because they are large and need to be physically contiguous. alloc_contig_pages() can fail. If initializing TDX when creating the first TDX guest, then there's chance that KVM won't be able to run any TDX guests albeit KVM _declares_ to be able to support TDX. This isn't good for the user. On the other hand, initializing TDX at KVM module loading time can make sure KVM is providing a consistent view of whether KVM can support TDX to the user. Always only try to initialize TDX after VMX has been initialized. TDX is based on VMX, and if VMX fails to initialize then TDX is likely to be broken anyway. Also, in practice, supporting TDX will require part of VMX and common x86 infrastructure in working order, so TDX cannot be enabled alone w/o VMX support. There are two cases that can result in failure to initialize TDX: 1) TDX cannot be supported (e.g., because of TDX is not supported or enabled by hardware, or module is not loaded, or missing some dependency in KVM's configuration); 2) Any unexpected error during TDX bring-up. For the first case only mark TDX is disabled but still allow KVM module to be loaded. For the second case just fail to load the KVM module so that the user can be aware. Because TDX costs additional memory, don't enable TDX by default. Add a new module parameter 'enable_tdx' to allow the user to opt-in. Note, the name tdx_init() has already been taken by the early boot code. Use tdx_bringup() for initializing TDX (and tdx_cleanup() since KVM doesn't actually teardown TDX). They don't match vt_init()/vt_exit(), vmx_init()/vmx_exit() etc but it's not end of the world. Also, once initialized, the TDX module cannot be disabled and enabled again w/o the TDX module runtime update, which isn't supported by the kernel. After TDX is enabled, nothing needs to be done when KVM disables hardware virtualization, e.g., when offlining CPU, or during suspend/resume. TDX host core-kernel code internally tracks TDX status and can handle "multiple enabling" scenario. Similar to KVM_AMD_SEV, add a new KVM_INTEL_TDX Kconfig to guide KVM TDX code. Make it depend on INTEL_TDX_HOST but not replace INTEL_TDX_HOST because in the longer term there's a use case that requires making SEAMCALLs w/o KVM as mentioned by Dan [1]. Link: https://lore.kernel.org/6723fc2070a96_60c3294dc@dwillia2-mobl3.amr.corp.intel.com.notmuch/ [1] Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <162f9dee05c729203b9ad6688db1ca2960b4b502.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: Export hardware virtualization enabling/disabling functionsKai Huang
To support TDX, KVM will need to enable TDX during KVM module loading time. Enabling TDX requires enabling hardware virtualization first so that all online CPUs (and the new CPU going online) are in post-VMXON state. KVM by default enables hardware virtualization but that is done in kvm_init(), which must be the last step after all initialization is done thus is too late for enabling TDX. Export functions to enable/disable hardware virtualization so that TDX code can use them to handle hardware virtualization enabling before kvm_init(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <dfe17314c0d9978b7bc3b0833dff6f167fbd28f5.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>